Should AI Crawlers Enter Your Site? Use This Table to Decide in 3 Minutes

Larry

You run a content site, knowledge base, or small media publication. One day, your traffic report suddenly shows a new batch of AI crawlers: some say they are for search indexing, some look like agents fetching live information for users, and some may be collecting content for model training.

On July 1, 2026, Cloudflare said it was moving Pay per crawl toward Pay per use: content should not only have a price when an AI system fetches it, but also when that content appears in an AI answer and creates value for an AI product. Cloudflare also said that, starting September 15, the default setting for new customers and new domains of existing customers will block training and agent purposes on ad-supported pages; crawlers that do not clearly separate search, training, and agent purposes will also fall under a stricter default treatment. On the surface, this is a platform policy update. But the bigger reminder for content sites is this: you can no longer treat every AI bot as the same kind of traffic.

If the only question is “Should we block AI?”, the answer usually gets stuck. A more usable question is: which kind of crawler, for what purpose, at what cost of access, and who is responsible if something goes wrong?

In this lesson:

Split AI crawlers into three access scenarios: search, agent, and training.
Use one decision table to decide whether to allow, observe, charge, or block.
Turn the policy into a first version without waiting for legal, engineering, and content teams to all be in the same room.

First, separate the categories: an AI crawler is not one thing

A crawler can be understood as “a program that automatically reads website content.” In the past, you may have mostly dealt with search engine bots: they fetched pages, built an index, and later sent readers back to your site.

The difficulty with AI crawlers is that their purposes are now splitting apart:

Search crawlers: their goal is to make your content appear in AI search or summary results.
Agent crawlers: an agent can be understood as “an AI assistant that can carry out several steps of a task on its own.” It may read your pages in real time for a user, compare prices, check specifications, or summarize answers.
Training crawlers: their goal is to include content in model training or datasets, and they may not send identifiable reader traffic back to you.

These three types of traffic create different value and risk for a website. If you put all of them under one rule, two bad outcomes appear: search exposure that should have been allowed gets blocked, while training access that should have required licensing or rate limits quietly gets through.

BMC has previously discussed why the agentic web needs machine-readable entry points and acceptance criteria. If a site does not define its access rules clearly, AI tools can only guess. You can read this lesson together with “Prepare machine-readable doors before letting agents enter your site”: that one handles “who gets through the door,” while this one handles “what the visit counts as after entering.”

An AI crawler access policy table

The table below is not meant to make your policy perfect in one pass. Its purpose is to move your “default reaction” from emotion to policy. Each row can later be translated into robots.txt, WAF rules, bot management settings, contract terms, or an internal handling process.

Crawler type	Conditions for allowing	When to charge or require licensing	Signals to block first	First measurable metric
Search indexing	Clearly states that the purpose is search or indexing; can send identifiable referral traffic; crawl frequency is no more than 2x a normal search bot	If summary pages heavily replace clicks and AI referral is below 1% of total content-page visits over 30 days, move it into commercial discussion	Opaque User-Agent, frequently changing IP sources, scanning the entire archive in a short period	Weekly AI referral, pages crawled, server cost
Agent real-time reading	Reads only public pages; does not log in, place orders, or submit forms; requests triggered by each end user can be rate-limited	If the agent needs large-scale real-time access to data, prices, inventory, or professional database content, require an API or paid plan	Attempts to bypass login walls, repeatedly triggers search pages, submits forms, or simulates user actions	Daily request peak, error rate, sensitive paths triggered
Training or dataset use	Allow only when there is existing public licensing, clear opt-in, or content that is already reusable	Original articles, paid content, databases, and research reports should require licensing or payment by default	Does not explain purpose, mixes training and search under the same bot, cannot provide deletion or opt-out methods	Words crawled, repeat crawl ratio, licensing response rate
Internal or partner bot	Has fixed IPs, a clear responsible contact, and separates test traffic from production traffic	If it exceeds the original partnership scope, such as moving from search exposure to training data, require renewed approval	No owner, no change notice, traffic spike that nobody acknowledges	Partner request volume, response time to anomaly alerts

The point of this table is to put “purpose” at the first layer instead of looking only at technical names. User-Agents can be faked, and IPs can change. But your policy should first answer: what exchange relationship does this access behavior create for the site?

If it helps your content be found, the focus is exposure quality and cost.
If it reads data in real time for users, the focus is rate limits, permissions, and operation boundaries.
If it takes content for training, the focus is licensing, compensation, and opt-out mechanisms.

Run it once as a text decision tree

If your team does not yet have a crawler policy, follow this order. You do not need to write a full rulebook at the beginning.

Does this crawler clearly state its purpose?
- Yes: move to the next question.
- No: put it on an observation or block list until the other side can provide purpose, contact method, and opt-out method.
Is its purpose search, agent real-time reading, or training?
- Search: check whether it actually sends identifiable traffic back.
- Agent: check whether it only reads public data and does not trigger login, forms, or transactions.
- Training: check whether content licensing and commercial exchange are valid.
Does it create measurable cost?
- For example: daily request volume exceeds the normal bot average by 2x, error rate increases, cache hit rate drops, or search pages are scanned heavily.
- If there is cost: change to rate limiting, charging, API access, or human review.
- If there is no cost: keep observing for now, but set a 30-day review.
Does it touch high-value content?
- For example: paid articles, member data, original research, product databases, course content, pricing pages, or inventory pages.
- If yes: do not allow training or large-scale reading for free by default.
- If no: you can use a looser policy, but still keep logs.
Does this rule have an owner?
- An owner is the person responsible for the final judgment and follow-through. A policy without an owner quickly becomes a block list that nobody maintains.
- Assign at least one content owner and one technical owner, and write down when the policy should be reviewed again.

The tree is simple to use: do not rush to choose “open everything” or “block everything.” First put the crawler into the right scenario, then decide whether to allow, observe, charge, or block.

A first-version policy for small content sites

If you only have a small team and no dedicated legal or infrastructure staff, start with four actions.

1. Write the default stance for the three purposes

Use one sentence for each:

Search indexing: allowed, but the source must be identifiable and frequency must be controlled.
Agent real-time reading: may read public pages, but must not perform sensitive actions for users.
Training use: original content requires explicit authorization by default.

These three sentences can stay in an internal document at first. They do not have to be published immediately. But once the team has a shared stance, every new bot will not restart the same argument.

2. List your high-value paths

Do not begin with sitewide rules. Start by listing high-value paths:

/members/: member content.
/courses/: course pages.
/pricing/: pricing and plans.
/research/: original research or databases.
Internal search pages, filter pages, and API-like query pages.

An API can be understood as “an interface for systems to exchange data with other systems.” If some pages already behave like APIs because they are being queried at high volume, they should be handled with APIs, rate limits, or licensing instead of letting crawlers freely scan pages.

3. Separate “content is visible” from “content can be taken at scale”

A public page means human readers can view it. It does not mean machines can fetch it in unlimited volume. This distinction should be written into the policy.

You can use this sentence as an internal principle:

Public reading is not the same as automated large-scale reuse; large-scale reuse requires a stated purpose, frequency limits, and separate authorization for high-value content.

This sentence also helps align content and engineering teams. The content team cares about value and licensing. The engineering team cares about traffic and stability. The same principle connects both sides.

4. Set a 30-day review point

Do not make the first policy permanent. Set a review every 30 days and look at these five numbers:

Total AI crawler requests.
Purpose categories of the top 10 crawlers.
AI referral or identifiable return traffic.
Crawl counts on high-value paths.
Error rate, cost, or support issues caused by crawlers.

If a crawler brings identifiable readers, has low cost, and states its purpose clearly, you can keep allowing it. If it consumes large resources, has an unclear purpose, and touches high-value content, you should not rely only on default goodwill.

Common mistake: putting all AI traffic behind one switch

The easiest way to create trouble is to see “AI crawler” in the admin panel and turn on one global switch. That mixes three different questions:

Exposure question: do I want AI search to find me?
Operation question: can I let agents read pages, query data, or walk through processes for users?
Licensing question: can my content be used for training or dataset building?

The answers to these three questions may be completely different. You may be willing to let search crawlers read public articles because they may bring readers. You may also be willing to let agents read FAQ pages, but not log in to member areas or submit forms. At the same time, you may require training crawlers to negotiate licensing first.

If your site is already handling AI content noise, you can also refer to “Use information entry points and source rules to handle AI-generated content noise.” Entry rules and crawler policies work as a pair: the former decides what content you accept, and the latter decides how your content can be used by machines.

When should you charge?

Charging is not necessary for every site, and it may not be immediately practical. A more realistic threshold is this: when the crawler’s purpose has gone beyond “helping you be found” and has started consuming your content assets or infrastructure, it should enter a charging or licensing discussion.

Use three questions:

Is the other side using your content to build value in its own product?
- For example: summarized answers, training data, database reconstruction, or real-time Q&A.
Are you receiving an equivalent return?
- For example: identifiable traffic, brand exposure, licensing fees, partnership data, or API usage fees.
Are you carrying extra cost?
- For example: server load, cache pressure, content substitution, support confusion, or licensing risk.

If two of the three answers are “yes,” do not treat it as ordinary bot traffic. Your options then include rate limiting, requiring registration, routing through an API, commercial licensing, paid crawl access, or blocking until the other side explains its purpose.

The first step you can take today

If you can do only one thing today, do not start by chasing a complete tool list. Open a document named “AI crawler access policy v0.1” and write four sections:

The conditions under which we allow search-indexing crawlers.
The boundaries under which we allow agent real-time reading.
Our default licensing stance toward training crawlers.
Which paths count as high-value content and require extra review.

Then assign owners and a 30-day review date. This document can be short, but it gives later technical settings a basis: robots.txt, bot management, WAF, API keys, paywalls, and contract terms are all different execution layers for the same policy.

The real difficulty is not how Cloudflare or any single platform changes its rules. The difficulty is that content sites have long treated “readable” and “available for machine use” as the same thing. What is worth adding now is an access policy that can be maintained over time.

Everyday four-panel comic

Four-panel comic showing a content team separating AI crawlers into search, agent, and training access rules

A content team sees a wave of new AI crawlers and does not rush to open or block everything.
They separate the visits into search exposure, agent real-time reading, and training-data use.
Each purpose gets a boundary: allow, charge, block, or send to human review.
The team turns the choices into a policy table so the next crawler can be judged from shared rules.

AI handoff card

You can give the following prompt to AI and ask it to produce a first crawler policy. Before using it, replace the bracketed parts with your site information.

You are a technical and content policy consultant for a content site. Based on the information below, help me draft an “AI crawler access policy v0.1”.

Site type: [for example: small media site, knowledge base, course site, product documentation, database]
High-value content paths: [list URL paths, such as /members/, /courses/, /research/]
Known AI crawlers today: [list names, User-Agents, and traffic overview; write unknown if you do not know]
Returns we want: [search exposure, referral, licensing fees, API usage, partnership data]
Risks we worry about most: [server cost, content substitution, training use, member content leakage, form abuse]

Please output:
1. Conditions for allowing search-indexing crawlers.
2. Allowed and forbidden behaviors for agent real-time reading crawlers.
3. Licensing stance for training or dataset crawlers.
4. Signals that must be blocked first or sent to human review.
5. Five metrics to review after 30 days.

Constraints:
- Do not write only abstract principles. Every item must have executable conditions.
- Distinguish public reading from automated large-scale reuse.
- If information is insufficient, list the questions that must be asked of the site owner.

The purpose of this handoff card is not to make a legal judgment for you. It is to arrange questions scattered across content, engineering, and business into one shared list. Once the list takes shape, you can tell which parts should be blocked with tools, which parts should be negotiated through licensing, and which parts can safely remain open.

References

Cloudflare Press Release: Cloudflare Allows the Agentic Internet to Flourish with a Simple Philosophy: Your Content, Your Rules — https://www.cloudflare.com/press/press-releases/2026/cloudflare-allows-the-agentic-internet-to-flourish-with-a-simple-philosophy-your-content-your-rules/ (2026-07-01)

TechCrunch: Cloudflare’s new policy pushes AI companies to pay for publishers’ content — https://techcrunch.com/2026/07/01/cloudflares-new-policy-pushes-ai-companies-to-pay-for-publishers-content/ (2026-07-01)

Help Net Security: Cloudflare changes AI crawler access rules — https://www.helpnetsecurity.com/2026/07/02/cloudflare-ai-crawler-controls/ (2026-07-02)

Cloudflare Blog: Introducing pay per crawl — https://blog.cloudflare.com/introducing-pay-per-crawl/ (2025-07-01, background on the per-crawl mechanism)