Website
You can connect a website you own as a data source for your AI Search instance. AI Search crawls and indexes the pages automatically.
You can only crawl domains that you have onboarded onto the same Cloudflare account. Refer to Onboard a domain for more information on adding a domain to your Cloudflare account.
You can connect a website when creating a new instance through the dashboard, the REST API, or Wrangler. Website is an optional data source that you can add alongside built-in storage.
When you connect a domain, the crawler looks for your website's sitemap to determine which pages to visit:
- If you configure one or more custom sitemap URLs in the dashboard under Parser options > Specific sitemap, AI Search crawls only those sitemap URLs.
- Otherwise, the crawler checks
robots.txtfor listed sitemaps. - If no
robots.txtis found, the crawler checks for a sitemap at/sitemap.xml. - If no sitemap is available, the domain cannot be crawled.
If your sitemaps include <priority> attributes, AI Search reads all sitemaps and indexes pages based on each page's priority value, regardless of which sitemap the page is in.
If no <priority> is specified, pages are indexed in the order the sitemaps are provided, either from the configured custom sitemap URLs or from robots.txt from top to bottom.
AI Search supports .gz compressed sitemaps. Both robots.txt and sitemaps can use partial URLs.
During scheduled or manual sync jobs, the crawler will check for changes to the <lastmod> attribute in your sitemap. If it has been changed to a date occurring after the last sync date, then the page will be crawled, the updated version is stored, and automatically reindexed so that your search results always reflect the latest content.
If the <lastmod> attribute is not defined, AI Search uses the <changefreq> attribute to determine how often to re-crawl the URL. If neither <lastmod> nor <changefreq> is defined, AI Search automatically crawls each link once a day.
For instances with built-in storage, crawled pages are stored in managed storage automatically.
For older instances created before April 16, 2026, AI Search creates a dedicated R2 bucket in your account to store crawled pages. This bucket is automatically managed and is used only for content discovered by the crawler.
You can control which pages get indexed by defining include and exclude rules for URL paths. Use this to limit indexing to specific sections of your site or to exclude content you do not want searchable.
For example, to index only blog posts while excluding drafts:
- Include:
**/blog/** - Exclude:
**/blog/drafts/**
Refer to Path filtering for pattern syntax, filtering behavior, and more examples.
For supported file types and size limits, refer to Data source.
You can configure parsing options during onboarding or in your instance settings under Parser options.
By default, AI Search crawls all sitemaps listed in your robots.txt in the order they appear (top to bottom). If you do not want the crawler to index everything, or if your sitemap is hosted at a non-standard path, you can configure custom sitemap URLs in the dashboard under Parser options > Specific sitemap.
When custom sitemap URLs are configured, AI Search uses those sitemap URLs instead of auto-discovering sitemaps from robots.txt or /sitemap.xml. You can add up to five sitemap URLs.
You can choose how pages are parsed during crawling:
- Static sites: Downloads the raw HTML for each page.
- Rendered sites: Loads pages with a headless browser and downloads the fully rendered version, including dynamic JavaScript content. For instances with built-in storage, Browser Run is included. For older instances, Browser Run limits and billing apply.
If your website has pages behind authentication or are only visible to logged-in users, you can configure custom HTTP headers to allow the AI Search crawler to access this protected content. You can add up to five custom HTTP headers to the requests AI Search sends when crawling your site.
To allow AI Search to crawl a site protected by Cloudflare Access, you need to create service token credentials and configure them as custom headers.
Service tokens bypass user authentication, so ensure your Access policies are configured appropriately for the content you want to index. The service token will allow the AI Search crawler to access all content covered by the Service Auth policy.
-
In Cloudflare One ↗, create a service token. Once the Client ID and Client Secret are generated, save them for the next steps. For example they can look like:
CF-Access-Client-Id: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.accessCF-Access-Client-Secret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx -
Create a policy with the following configuration:
- Add an Include rule with Selector set to Service token.
- In Value, select the Service Token you created in step 1.
-
Add your self-hosted application to Access and with the following configuration:
- In Access policies, click Select existing policies.
- Select the policy that you have just created and select Confirm.
-
In the Cloudflare dashboard, go to the AI Search page.
Go to AI Search -
Select Create.
-
Select Website as your data source.
-
Under Parse options, locate Extra headers and add the following two headers using your saved credentials:
- Header 1:
- Key:
CF-Access-Client-Id - Value:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.access
- Key:
- Header 2:
- Key:
CF-Access-Client-Secret - Value:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
- Key:
- Header 1:
-
Complete the AI Search setup process to create your search instance.
You can attach custom metadata to web pages using HTML <meta> tags. AI Search extracts metadata from the <head> section of each crawled page.
Before custom metadata can be extracted, you must define a schema in your AI Search configuration.
Add <meta> tags using either the name or property attribute:
<!DOCTYPE html><html> <head> <meta name="title" content="Getting Started Guide" /> <meta name="description" content="Learn how to set up the application" /> <meta property="og:title" content="Getting Started Guide" /> <meta property="og:image" content="https://example.com/og-image.png" /> <meta name="category" content="documentation" /> <meta name="version" content="2.5" /> <meta name="is_public" content="true" /> </head> <body> <!-- Page content --> </body></html>For the following fields, AI Search knows which meta tags to extract from. You must still define these in your schema to enable extraction.
| Field | Source |
|---|---|
title | <meta name="title"> or <meta property="og:title"> |
description | <meta name="description"> or <meta property="og:description"> |
image | <meta property="og:image"> |
When both a standard meta tag and an Open Graph tag are present, the standard meta tag takes precedence.
When the crawler fetches a page:
- All
<meta>tags withnameorpropertyattributes are parsed from the<head>section. - Tag names are matched against your schema (case-insensitive).
- The
contentattribute value is cast to the configured data type. - Extracted metadata is stored alongside the cached HTML.
- On subsequent processing, metadata flows into the vector index.
For boolean fields, the following values are accepted (case-insensitive):
| True values | False values |
|---|---|
true, 1, yes | false, 0, no |
Any other value is treated as invalid and the field is omitted.
Content selectors let you control which parts of a crawled page are indexed. Each entry pairs a URL glob pattern with a CSS selector. When a page URL matches a glob pattern, only the elements matching the corresponding CSS selector — and their descendants — are extracted and converted to Markdown for indexing.
The list is ordered and the first matching path wins. If a page URL matches multiple glob patterns, only the selector from the first match is applied. Order your entries from most specific to least specific.
Without content selectors, AI Search applies a default processing pipeline that removes elements such as <header>, <footer>, and <head> before converting the remaining content to Markdown. For more details on how HTML is processed, refer to How HTML is processed.
-
Go to the AI Search ↗ page in the Cloudflare dashboard.
Go to AI Search -
Select your AI Search instance, or select Create to create a new one with a Website data source.
-
Under the data source settings, locate the Content selectors section.
-
Select Add selector.
-
In the Path field, enter a glob pattern to match page URLs. For example,
**/blog/**. -
In the Selector field, enter a CSS selector to extract content from matching pages. For example,
article .post-body. -
To add more entries, select Add selector again. Entries are evaluated in order from top to bottom.
Content selectors are configured in the source_params.web_crawler.parse_options.content_selector field when creating or updating an AI Search instance. The field accepts an array of objects, each with a path and selector property.
curl "https://api.cloudflare.com/client/v4/accounts/{account_id}/ai-search/instances" \ -H "Authorization: Bearer {api_token}" \ -H "Content-Type: application/json" \ -d '{ "id": "my-ai-search", "source": "https://example.com", "type": "web-crawler", "source_params": { "web_crawler": { "parse_options": { "content_selector": [ { "path": "**/blog/**", "selector": "article .post-body" }, { "path": "**/docs/**", "selector": "main .content" } ] } } } }'| Field | Type | Description |
|---|---|---|
path | string | Glob pattern to match against the full page URL. Uses the same glob syntax as path filtering — * matches within a segment, ** crosses directories. Maximum 200 characters. |
selector | string | CSS selector to extract content from pages matching the path pattern. Supports standard CSS selectors including element, class, ID, and attribute selectors. Maximum 200 characters. |
To index only the article body on blog pages and ignore navigation, sidebars, and footers:
| Path | Selector |
|---|---|
**/blog/** | article .post-body |
To index the main content area of a documentation site:
| Path | Selector |
|---|---|
**/docs/** | main .content |
You can define multiple entries to apply different selectors to different parts of your site. The first matching path wins, so place more specific patterns first:
| Path | Selector |
|---|---|
**/blog/releases/** | .release-notes |
**/blog/** | article .post-body |
**/docs/** | main .content |
In this example, a page at https://example.com/blog/releases/v2 matches the first pattern and uses the .release-notes selector. A page at https://example.com/blog/my-post skips the first pattern and matches the second.
- Path filtering: Path filtering takes priority over content selectors. Pages excluded by path filters are never crawled, so content selectors do not apply to them.
- Browser Run: Content selectors apply to the HTML that AI Search receives. For sites that render content with JavaScript, turn on Browser Run so that selectors can target the fully rendered DOM.
- Automatic re-indexing: Updating content selectors triggers a new sync job immediately, so changes are applied to all indexed pages.
| Limit | Value |
|---|---|
| Maximum content selector entries | 10 |
| Maximum path pattern length | 200 characters |
| Maximum selector length | 200 characters |
Configure your robots.txt and sitemap to help AI Search crawl your site efficiently.
The AI Search crawler uses the user agent Cloudflare-AI-Search. Your robots.txt file should reference your sitemap and allow the crawler:
User-agent: *Allow: /
Sitemap: https://example.com/sitemap.xmlYou can list multiple sitemaps or use a sitemap index file:
User-agent: *Allow: /
Sitemap: https://example.com/sitemap.xmlSitemap: https://example.com/blog-sitemap.xmlSitemap: https://example.com/sitemap.xml.gzTo block all other crawlers but allow only AI Search:
User-agent: *Disallow: /
User-agent: Cloudflare-AI-SearchAllow: /
Sitemap: https://example.com/sitemap.xmlStructure your sitemap to give AI Search the information it needs to crawl efficiently:
<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://example.com/important-page</loc> <lastmod>2026-01-15</lastmod> <changefreq>weekly</changefreq> <priority>1.0</priority> </url> <url> <loc>https://example.com/other-page</loc> <lastmod>2026-01-10</lastmod> <changefreq>monthly</changefreq> <priority>0.5</priority> </url></urlset>Use these attributes to control crawling behavior:
| Attribute | Purpose | Recommendation |
|---|---|---|
<loc> | URL of the page | Required. Use full or partial URLs. |
<lastmod> | Last modification date | Include to enable change detection. AI Search re-crawls pages when this date changes. |
<changefreq> | Expected change frequency | Use when <lastmod> is not available. Values: always, hourly, daily, weekly, monthly, yearly, never. |
<priority> | Relative importance (0.0-1.0) | Set higher values for important pages. AI Search indexes pages in priority order. |
You can also use a Sitemap Index to bundle other, domain specific sitemaps:
<?xml version="1.0" encoding="UTF-8"?><sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>https://www.example.com/sitemap-blog.xml</loc> <lastmod>2024-08-15T10:00:00+00:00</lastmod> </sitemap> <sitemap> <loc>https://www.example.com/sitemap-docs.xml</loc> <lastmod>2024-08-10T12:00:00+00:00</lastmod> </sitemap></sitemapindex>When parsing a Sitemap Index, AI Search collects all child sitemaps and then crawls them recursively, collecting all relevant URLs present in your sitemaps.
- Include
<lastmod>on all URLs to enable efficient change detection during syncs. - Set
<priority>to control indexing order. Pages with higher priority are indexed first. - Use
<changefreq>as a fallback when<lastmod>is not available. - Use sitemap index files for large sites with multiple sitemaps.
- Compress large sitemaps using
.gzformat to reduce bandwidth. - Keep sitemaps under 50MB and 50,000 URLs per file (standard sitemap limits).
If you have Security rules configured to block bot activity, you can add a rule to allowlist the crawler bot.
-
In the Cloudflare dashboard, go to the Security rules page.
Go to Security rules -
To create a new empty rule, select Create rule > Custom rules.
-
Enter a descriptive name for the rule in Rule name, such as
Allow AI Search. -
Under When incoming requests match, use the Field drop-down list to choose Bot Detection ID. For Operator, select equals. For Value, enter
122933950. -
Under Then take action, in the Choose action dropdown, choose Skip.
-
Under Place at, select the order of the rule in the Select order dropdown to be First. Setting the order as First allows this rule to be applied before subsequent rules.
-
To save and deploy your rule, select Deploy.
The regular AI Search limits apply when using the Website data source.
The crawler will download and index pages only up to the maximum object limit supported for an AI Search instance, and it processes the first set of pages it visits until that limit is reached. In addition, any files that are downloaded but exceed the file size limit will not be indexed.
For instances with built-in storage, Browser Run and storage are included. For older instances, R2, Vectorize, and Browser Run are billed separately.