Skip to content
Cloudflare Docs

Website

The Website data source allows you to connect a domain you own so its pages can be crawled, stored, and indexed.

You can only crawl domains that you have onboarded onto the same Cloudflare account. Refer to Onboard a domain for more information on adding a domain to your Cloudflare account.

How website crawling works

When you connect a domain, the crawler looks for your website's sitemap to determine which pages to visit:

  1. The crawler first checks the robots.txt for listed sitemaps. If it exists, it reads all sitemaps existing inside.
  2. If no robots.txt is found, the crawler checks for a sitemap at /sitemap.xml.
  3. If no sitemap is available, the domain cannot be crawled.

Pages are indexed in order of the <priority> attribute set in the sitemap, if this field is defined.

AI Search supports .gz compressed sitemaps. Both robots.txt and sitemaps can use partial URLs.

Path filtering

You can control which pages get indexed by defining include and exclude rules for URL paths. Use this to limit indexing to specific sections of your site or to exclude content you do not want searchable.

For example, to index only blog posts while excluding drafts:

  • Include: **/blog/**
  • Exclude: **/blog/drafts/**

Refer to Path filtering for pattern syntax, filtering behavior, and more examples.

Best practices for robots.txt and sitemap

Configure your robots.txt and sitemap to help AI Search crawl your site efficiently.

robots.txt

The AI Search crawler uses the user agent Cloudflare-AutoRAG. Your robots.txt file should reference your sitemap and allow the crawler:

robots.txt
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

You can list multiple sitemaps or use a sitemap index file:

robots.txt
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/blog-sitemap.xml
Sitemap: https://example.com/sitemap.xml.gz

To block all other crawlers but allow only AI Search:

robots.txt
User-agent: *
Disallow: /
User-agent: Cloudflare-AutoRAG
Allow: /
Sitemap: https://example.com/sitemap.xml

Sitemap

Structure your sitemap to give AI Search the information it needs to crawl efficiently:

sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/important-page</loc>
<lastmod>2026-01-15</lastmod>
<changefreq>weekly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://example.com/other-page</loc>
<lastmod>2026-01-10</lastmod>
<changefreq>monthly</changefreq>
<priority>0.5</priority>
</url>
</urlset>

Use these attributes to control crawling behavior:

AttributePurposeRecommendation
<loc>URL of the pageRequired. Use full or partial URLs.
<lastmod>Last modification dateInclude to enable change detection. AI Search re-crawls pages when this date changes.
<changefreq>Expected change frequencyUse when <lastmod> is not available. Values: always, hourly, daily, weekly, monthly, yearly, never.
<priority>Relative importance (0.0-1.0)Set higher values for important pages. AI Search indexes pages in priority order.

Recommendations

  • Include <lastmod> on all URLs to enable efficient change detection during syncs.
  • Set <priority> to control indexing order. Pages with higher priority are indexed first.
  • Use <changefreq> as a fallback when <lastmod> is not available.
  • Use sitemap index files for large sites with multiple sitemaps.
  • Compress large sitemaps using .gz format to reduce bandwidth.
  • Keep sitemaps under 50MB and 50,000 URLs per file (standard sitemap limits).

How to set WAF rules to allowlist the crawler

If you have Security rules configured to block bot activity, you can add a rule to allowlist the crawler bot.

  1. In the Cloudflare dashboard, go to the Security rules page.

    Go to Security rules
  2. To create a new empty rule, select Create rule > Custom rules.

  3. Enter a descriptive name for the rule in Rule name, such as Allow AI Search.

  4. Under When incoming requests match, use the Field drop-down list to choose Bot Detection ID. For Operator, select equals. For Value, enter 122933950.

  5. Under Then take action, in the Choose action dropdown, choose Skip.

  6. Under Place at, select the order of the rule in the Select order dropdown to be First. Setting the order as First allows this rule to be applied before subsequent rules.

  7. To save and deploy your rule, select Deploy.

Parsing options

You can choose how pages are parsed during crawling:

  • Static sites: Downloads the raw HTML for each page.
  • Rendered sites: Loads pages with a headless browser and downloads the fully rendered version, including dynamic JavaScript content. Note that the Browser Rendering limits and billing apply.

Access protected content

If your website has pages behind authentication or are only visible to logged-in users, you can configure custom HTTP headers to allow the AI Search crawler to access this protected content. You can add up to five custom HTTP headers to the requests AI Search sends when crawling your site.

Providing access to sites protected by Cloudflare Access

To allow AI Search to crawl a site protected by Cloudflare Access, you need to create service token credentials and configure them as custom headers.

Service tokens bypass user authentication, so ensure your Access policies are configured appropriately for the content you want to index. The service token will allow the AI Search crawler to access all content covered by the Service Auth policy.

  1. In Cloudflare One, create a service token. Once the Client ID and Client Secret are generated, save them for the next steps. For example they can look like:

    CF-Access-Client-Id: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.access
    CF-Access-Client-Secret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  2. Create a policy with the following configuration:

    • Add an Include rule with Selector set to Service token.
    • In Value, select the Service Token you created in step 1.
  3. Add your self-hosted application to Access and with the following configuration:

    • In Access policies, click Select existing policies.
    • Select the policy that you have just created and select Confirm.
  4. In the Cloudflare dashboard, go to the AI Search page.

    Go to AI Search
  5. Select Create.

  6. Select Website as your data source.

  7. Under Parse options, locate Extra headers and add the following two headers using your saved credentials:

    • Header 1:
      • Key: CF-Access-Client-Id
      • Value: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.access
    • Header 2:
      • Key: CF-Access-Client-Secret
      • Value: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  8. Complete the AI Search setup process to create your search instance.

Storage

During setup, AI Search creates a dedicated R2 bucket in your account to store the pages that have been crawled and downloaded as HTML files. This bucket is automatically managed and is used only for content discovered by the crawler. Any files or objects that you add directly to this bucket will not be indexed.

Sync and updates

During scheduled or manual sync jobs, the crawler will check for changes to the <lastmod> attribute in your sitemap. If it has been changed to a date occurring after the last sync date, then the page will be crawled, the updated version is stored in the R2 bucket, and automatically reindexed so that your search results always reflect the latest content.

If the <lastmod> attribute is not defined, AI Search uses the <changefreq> attribute to determine how often to re-crawl the URL. If neither <lastmod> nor <changefreq> is defined, AI Search automatically crawls each link once a day.

Limits

The regular AI Search limits apply when using the Website data source.

The crawler will download and index pages only up to the maximum object limit supported for an AI Search instance, and it processes the first set of pages it visits until that limit is reached. In addition, any files that are downloaded but exceed the file size limit will not be indexed.