Website

You can connect a website you own as a data source for your AI Search instance. AI Search crawls and indexes the pages automatically.

You can only crawl domains that you have onboarded onto the same Cloudflare account. Refer to Onboard a domain for more information on adding a domain to your Cloudflare account.

Get started

You can connect a website when creating a new instance through the dashboard, the REST API, or Wrangler. Website is an optional data source that you can add alongside built-in storage.

How website crawling works

When you connect a domain, the crawler looks for your website's sitemap to determine which pages to visit:

If you configure one or more custom sitemap URLs in the dashboard under Parser options > Specific sitemap, AI Search crawls only those sitemap URLs.
Otherwise, the crawler checks robots.txt for listed sitemaps.
If no robots.txt is found, the crawler checks for a sitemap at /sitemap.xml.
If no sitemap is available, the domain cannot be crawled.

Indexing order

If your sitemaps include <priority> attributes, AI Search reads all sitemaps and indexes pages based on each page's priority value, regardless of which sitemap the page is in.

If no <priority> is specified, pages are indexed in the order the sitemaps are provided, either from the configured custom sitemap URLs or from robots.txt from top to bottom.

AI Search supports .gz compressed sitemaps. Both robots.txt and sitemaps can use partial URLs.

Sync and updates

During scheduled or manual sync jobs, the crawler will check for changes to the <lastmod> attribute in your sitemap. If it has been changed to a date occurring after the last sync date, then the page will be crawled, the updated version is stored, and automatically reindexed so that your search results always reflect the latest content.

If the <lastmod> attribute is not defined, AI Search uses the <changefreq> attribute to determine how often to re-crawl the URL. If neither <lastmod> nor <changefreq> is defined, AI Search automatically crawls each link once a day.

Storage

For instances with built-in storage, crawled pages are stored in managed storage automatically.

For older instances created before April 16, 2026, AI Search creates a dedicated R2 bucket in your account to store crawled pages. This bucket is automatically managed and is used only for content discovered by the crawler.

Configuration

Path filtering

You can control which pages get indexed by defining include and exclude rules for URL paths. Use this to limit indexing to specific sections of your site or to exclude content you do not want searchable.

For example, to index only blog posts while excluding drafts:

Include: **/blog/**
Exclude: **/blog/drafts/**

Refer to Path filtering for pattern syntax, filtering behavior, and more examples.

For supported file types and size limits, refer to Data source.

Parsing options

You can configure parsing options during onboarding or in your instance settings under Parser options.

Specific sitemap

By default, AI Search crawls all sitemaps listed in your robots.txt in the order they appear (top to bottom). If you do not want the crawler to index everything, or if your sitemap is hosted at a non-standard path, you can configure custom sitemap URLs in the dashboard under Parser options > Specific sitemap.

When custom sitemap URLs are configured, AI Search uses those sitemap URLs instead of auto-discovering sitemaps from robots.txt or /sitemap.xml. You can add up to five sitemap URLs.

Rendering mode

You can choose how pages are parsed during crawling:

Static sites: Downloads the raw HTML for each page.
Rendered sites: Loads pages with a headless browser and downloads the fully rendered version, including dynamic JavaScript content. For instances with built-in storage, Browser Run is included. For older instances, Browser Run limits and billing apply.

Extra headers

If your website has pages behind authentication or are only visible to logged-in users, you can configure custom HTTP headers to allow the AI Search crawler to access this protected content. You can add up to five custom HTTP headers to the requests AI Search sends when crawling your site.

Providing access to sites protected by Cloudflare Access

To allow AI Search to crawl a site protected by Cloudflare Access, you need to create service token credentials and configure them as custom headers.

Service tokens bypass user authentication, so ensure your Access policies are configured appropriately for the content you want to index. The service token will allow the AI Search crawler to access all content covered by the Service Auth policy.

In the Cloudflare dashboard ↗, create a service token. Once the Client ID and Client Secret are generated, save them for the next steps. For example they can look like:
```
CF-Access-Client-Id: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.access
CF-Access-Client-Secret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
```
Create a policy with the following configuration:
- Add an Include rule with Selector set to Service token.
- In Value, select the Service Token you created in step 1.
Add your self-hosted application to Access and with the following configuration:
- In Access policies, click Select existing policies.
- Select the policy that you have just created and select Confirm.
In the Cloudflare dashboard, go to the AI Search page.
Go to AI Search
Select Create.
Select Website as your data source.
Under Parse options, locate Extra headers and add the following two headers using your saved credentials:
- Header 1:
  - Key: CF-Access-Client-Id
  - Value: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.access
- Header 2:
  - Key: CF-Access-Client-Secret
  - Value: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Complete the AI Search setup process to create your search instance.

Custom metadata

You can attach custom metadata to web pages using HTML <meta> tags. AI Search extracts metadata from the <head> section of each crawled page.

Before custom metadata can be extracted, you must define a schema in your AI Search configuration.

Add metadata to web pages

Add <meta> tags using either the name or property attribute:

<!DOCTYPE html>
<html>
  <head>
    <meta name="title" content="Getting Started Guide" />
    <meta name="description" content="Learn how to set up the application" />
    <meta property="og:title" content="Getting Started Guide" />
    <meta property="og:image" content="https://example.com/og-image.png" />
    <meta name="category" content="documentation" />
    <meta name="version" content="2.5" />
    <meta name="is_public" content="true" />
  </head>
  <body>
    <!-- Page content -->
  </body>
</html>

Recognized fields

For the following fields, AI Search knows which meta tags to extract from. You must still define these in your schema to enable extraction.

Field	Source
`title`	`<meta name="title">` or `<meta property="og:title">`
`description`	`<meta name="description">` or `<meta property="og:description">`
`image`	`<meta property="og:image">`

When both a standard meta tag and an Open Graph tag are present, the standard meta tag takes precedence.

How metadata extraction works

When the crawler fetches a page:

All <meta> tags with name or property attributes are parsed from the <head> section.
Tag names are matched against your schema (case-insensitive).
The content attribute value is cast to the configured data type.
Extracted metadata is stored alongside the cached HTML.
On subsequent processing, metadata flows into the vector index.

Boolean value parsing

For boolean fields, the following values are accepted (case-insensitive):

True values	False values
`true`, `1`, `yes`	`false`, `0`, `no`

Any other value is treated as invalid and the field is omitted.

Content selectors

Content selectors let you control which parts of a crawled page are indexed. Each entry pairs a URL glob pattern with a CSS selector. When a page URL matches a glob pattern, only the elements matching the corresponding CSS selector — and their descendants — are extracted and converted to Markdown for indexing.

The list is ordered and the first matching path wins. If a page URL matches multiple glob patterns, only the selector from the first match is applied. Order your entries from most specific to least specific.

Default behavior

Without content selectors, AI Search applies a default processing pipeline that removes elements such as <header>, <footer>, and <head> before converting the remaining content to Markdown. For more details on how HTML is processed, refer to How HTML is processed.

Configure content selectors in the dashboard

Go to the AI Search ↗ page in the Cloudflare dashboard.
Go to AI Search
Select your AI Search instance, or select Create to create a new one with a Website data source.
Under the data source settings, locate the Content selectors section.
Select Add selector.
In the Path field, enter a glob pattern to match page URLs. For example, **/blog/**.
In the Selector field, enter a CSS selector to extract content from matching pages. For example, article .post-body.
To add more entries, select Add selector again. Entries are evaluated in order from top to bottom.

Configure content selectors via the API

Content selectors are configured in the source_params.web_crawler.parse_options.content_selector field when creating or updating an AI Search instance. The field accepts an array of objects, each with a path and selector property.

curl "https://api.cloudflare.com/client/v4/accounts/{account_id}/ai-search/instances" \
  -H "Authorization: Bearer {api_token}" \
  -H "Content-Type: application/json" \
  -d '{
    "id": "my-ai-search",
    "source": "https://example.com",
    "type": "web-crawler",
    "source_params": {
      "web_crawler": {
        "parse_options": {
          "content_selector": [
            {
              "path": "**/blog/**",
              "selector": "article .post-body"
            },
            {
              "path": "**/docs/**",
              "selector": "main .content"
            }
          ]
        }
      }
    }
  }'

Field	Type	Description
`path`	string	Glob pattern to match against the full page URL. Uses the same glob syntax as path filtering — `` matches within a segment, `*` crosses directories. Maximum 200 characters.
`selector`	string	CSS selector to extract content from pages matching the path pattern. Supports standard CSS selectors including element, class, ID, and attribute selectors. Maximum 200 characters.

Examples

Extract main content from blog pages

To index only the article body on blog pages and ignore navigation, sidebars, and footers:

Path	Selector
`/blog/`	`article .post-body`

Target documentation content

To index the main content area of a documentation site:

Path	Selector
`/docs/`	`main .content`

Different selectors for different sections

You can define multiple entries to apply different selectors to different parts of your site. The first matching path wins, so place more specific patterns first:

Path	Selector
`/blog/releases/`	`.release-notes`
`/blog/`	`article .post-body`
`/docs/`	`main .content`

In this example, a page at https://example.com/blog/releases/v2 matches the first pattern and uses the .release-notes selector. A page at https://example.com/blog/my-post skips the first pattern and matches the second.

Interaction with other features

Path filtering: Path filtering takes priority over content selectors. Pages excluded by path filters are never crawled, so content selectors do not apply to them.
Browser Run: Content selectors apply to the HTML that AI Search receives. For sites that render content with JavaScript, turn on Browser Run so that selectors can target the fully rendered DOM.
Automatic re-indexing: Updating content selectors triggers a new sync job immediately, so changes are applied to all indexed pages.

Limits

Limit	Value
Maximum content selector entries	10
Maximum path pattern length	200 characters
Maximum selector length	200 characters

Best practices for robots.txt and sitemap

Configure your robots.txt and sitemap to help AI Search crawl your site efficiently.

robots.txt

The AI Search crawler uses the user agent Cloudflare-AI-Search. Your robots.txt file should reference your sitemap and allow the crawler:

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

You can list multiple sitemaps or use a sitemap index file:

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/blog-sitemap.xml
Sitemap: https://example.com/sitemap.xml.gz

To block all other crawlers but allow only AI Search:

User-agent: *
Disallow: /

User-agent: Cloudflare-AI-Search
Allow: /

Sitemap: https://example.com/sitemap.xml

Sitemap

Structure your sitemap to give AI Search the information it needs to crawl efficiently:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/important-page</loc>
    <lastmod>2026-01-15</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/other-page</loc>
    <lastmod>2026-01-10</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
  </url>
</urlset>

Use these attributes to control crawling behavior:

Attribute	Purpose	Recommendation
`<loc>`	URL of the page	Required. Use full or partial URLs.
`<lastmod>`	Last modification date	Include to enable change detection. AI Search re-crawls pages when this date changes.
`<changefreq>`	Expected change frequency	Use when `<lastmod>` is not available. Values: `always`, `hourly`, `daily`, `weekly`, `monthly`, `yearly`, `never`.
`<priority>`	Relative importance (0.0-1.0)	Set higher values for important pages. AI Search indexes pages in priority order.

You can also use a Sitemap Index to bundle other, domain specific sitemaps:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://www.example.com/sitemap-blog.xml</loc>
    <lastmod>2024-08-15T10:00:00+00:00</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://www.example.com/sitemap-docs.xml</loc>
    <lastmod>2024-08-10T12:00:00+00:00</lastmod>
  </sitemap>
</sitemapindex>

When parsing a Sitemap Index, AI Search collects all child sitemaps and then crawls them recursively, collecting all relevant URLs present in your sitemaps.

Recommendations

Include <lastmod> on all URLs to enable efficient change detection during syncs.
Set <priority> to control indexing order. Pages with higher priority are indexed first.
Use <changefreq> as a fallback when <lastmod> is not available.
Use sitemap index files for large sites with multiple sitemaps.
Compress large sitemaps using .gz format to reduce bandwidth.
Keep sitemaps under 50MB and 50,000 URLs per file (standard sitemap limits).

Allow the AI Search crawler through WAF

If you have Security rules configured to block bot activity, you can add a rule to allowlist the crawler bot.

In the Cloudflare dashboard, go to the Security rules page.
Go to Security rules
To create a new empty rule, select Create rule > Custom rules.
Enter a descriptive name for the rule in Rule name, such as Allow AI Search.
Under When incoming requests match, use the Field drop-down list to choose Bot Detection ID. For Operator, select equals. For Value, enter 122933950.
Under Then take action, in the Choose action dropdown, choose Skip.
Under Place at, select the order of the rule in the Select order dropdown to be First. Setting the order as First allows this rule to be applied before subsequent rules.
To save and deploy your rule, select Deploy.

Limits and pricing

The regular AI Search limits apply when using the Website data source.

The crawler will download and index pages only up to the maximum object limit supported for an AI Search instance, and it processes the first set of pages it visits until that limit is reached. In addition, any files that are downloaded but exceed the file size limit will not be indexed.

For instances with built-in storage, Browser Run and storage are included. For older instances, R2, Vectorize, and Browser Run are billed separately.