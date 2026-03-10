/crawl - Crawl web content
The
/crawl endpoint scrapes content from a starting URL and follows links across the site, up to a configurable depth or page limit. Responses can be returned as HTML, Markdown, or JSON.
url(string)
Refer to optional parameters for additional customization options.
- Building knowledge bases or training AI systems (such as RAG applications) with up-to-date web content
- Scraping and analyzing content across multiple pages for research, summarization, or monitoring
There are two steps to using the
/crawl endpoint:
- Initiate the crawl job — A
POSTrequest where you initiate the crawl and receive a response with a job
id.
- Request results of the crawl job — A
GETrequest where you request the status or results of the crawl.
Crawl jobs have a maximum run time of seven days. If a job does not finish within this time, it will be cancelled due to timeout. Job results are available for 14 days after the job completes, after which the job data is deleted.
Send a
POST request with a
url to start a crawl job. The API responds immediately with a job
id you will use to retrieve results. Refer to optional parameters for additional customization options.
Example response:
To check the status or request the results of your crawl job, use the job
id you received:
The response includes a
status field indicating the current state of the crawl job. The possible job statuses are:
running— The crawl job is currently in progress.
cancelled_due_to_timeout— The crawl job exceeded the maximum run time of seven days.
cancelled_due_to_limits— The crawl job was cancelled because it hit account limits.
cancelled_by_user— The crawl job was manually cancelled by the user.
errored— The crawl job encountered an error.
completed— The crawl job finished successfully.
Since crawl jobs run asynchronously, you can poll the endpoint periodically to check when the job finishes. Add
?limit=1 to the request URL so the response stays lightweight — you only need the job
status, not the full set of crawled records.
Once the job reaches a terminal status, fetch the full results without the
limit parameter. You can also use the following query parameters to filter and paginate results:
cursor— Cursor for pagination. If the response exceeds 10 MB, a
cursorvalue will be included. Pass it as a query parameter to retrieve the next page of results.
limit— Maximum number of records to return.
status— Filter by URL status:
queued,
completed,
disallowed,
skipped,
errored, or
cancelled.
Example with query parameters:
Example response:
To cancel a crawl job that is currently in progress, use the job
id you received:
A successful cancellation will return a
200 OK status code. The job status will be updated to cancelled, and all URLs that have been queued to be crawled will be cancelled.
The following optional parameters can be used in your crawl request, in addition to the required
url parameter. For the full list, refer to the API docs.
|Optional parameter
|Type
|Description
limit
|Number
|Maximum number of pages to crawl (default is 10, maximum is 100,000).
depth
|Number
|Maximum link depth to crawl from the starting URL (default is 100,000, maximum is 100,000).
source
|String
|Source for discovering URLs. Options are
all,
sitemaps, or
links. Default is
all.
formats
|Array of strings
|Response format (default is HTML, other options are Markdown and JSON). The JSON format leverages Workers AI by default for data extraction, which incurs usage on Workers AI. Refer to the
/json endpoint to learn more, including how to use a custom model and fallbacks.
render
|Boolean
|If false, does a fast HTML fetch without executing JavaScript (default is true, learn more about
render).
jsonOptions
|Object
|Only required if
formats includes
json. Contains
prompt,
response_format, and
custom_ai properties (same types as the
/json endpoint).
maxAge
|Number
|Maximum length of time in seconds the crawler can use a cached resource before it must re-fetch it from the origin server (default is 86,400, maximum is 604,800). Cache is served from R2 only if the URL and parameters exactly match.
modifiedSince
|Number
|Unix timestamp (in seconds) indicating to only crawl pages that were modified since this time.
options.includeExternalLinks
|Boolean
|If true, follows links to external domains (default is false).
options.includeSubdomains
|Boolean
|If true, follows links to subdomains of the starting URL (default is false).
options.includePatterns
|Array of strings
|Only visits URLs that match one of these wildcard patterns. Use
* to match any characters except
/, or
** to match any characters including
/.
options.excludePatterns
|Array of strings
|Does not visit URLs that match any of these wildcard patterns. Use
* to match any characters except
/, or
** to match any characters including
/.
excludePatterns has strictly higher priority. If a URL matches an exclude rule, it is skipped, regardless of whether it matches an include rule.
- No rules — Everything is indexed.
- Exclude only — Everything is indexed except items matching the exclude patterns.
- Include only — Only items matching the include patterns are indexed; everything else is ignored.
To view URLs that were discovered but skipped, query the crawl job results with
status=skipped. URLs can be skipped due to
includeExternalLinks,
includeSubdomains,
includePatterns/
excludePatterns, or the
modifiedSince parameter. Skipped URLs will also be visible in the dashboard in a future release.
If you use
render: true, which is the default, the
crawl endpoint spins up a headless browser and executes page JavaScript. If you use
render: false, the
crawl endpoint does a fast HTML fetch without executing JavaScript.
Use
render: true when the page builds content in the browser. Use
render: false when the content you need is already in the initial HTML response.
Crawls that use
render: true use a headless browser and are billed under typical Browser Rendering pricing. Crawls that use
render: false run on Workers instead of a headless browser. During the beta,
render: false crawls are not billed. After the beta, they will be billed under Workers pricing.
Crawl only documentation pages and exclude specific sections:
Extract structured product data using the
json format. This leverages Workers AI by default.
Fetch static HTML without rendering for faster crawling of static sites:
Crawl pages behind HTTP authentication or with custom headers:
You can also use cookies or custom headers for token-based authentication:
Crawl single-page applications that load content dynamically:
Speed up crawling by blocking images and media:
The crawler discovers and processes URLs in the following order (when using
source: all, the default):
- Starting URL — The URL specified in your request.
- Sitemap links — URLs found in the site's sitemap.
- Page links — Links scraped from pages, if not already found in the sitemap.
Use the
source parameter to customize which sources the crawler uses. The available options are:
all— Uses both sitemaps and page links (default).
sitemaps— Only crawls URLs found in the site's sitemap.
links— Only crawls links found on pages, ignoring sitemaps.
The
/crawl endpoint respects the directives of
robots.txt files, including
crawl-delay. All URLs that
/crawl is directed not to crawl are listed in the response with
"status": "disallowed". For guidance on configuring
robots.txt and sitemaps for sites you plan to crawl, refer to robots.txt and sitemaps.
You can change the user agent at the page level by passing
userAgent as a top-level parameter in the JSON body. This is useful if the target website serves different content based on the user agent.
If your crawl job completes but returns an empty records array, or all URLs show
skipped or
disallowed status:
- robots.txt blocking — The crawler respects
robots.txtrules. Check the target site's
robots.txtfile to verify your user agent is allowed. Blocked URLs appear with
"status": "disallowed".
- Pattern filters too restrictive — Your
includePatternsmay not match any URLs on the site. Try crawling without patterns first to confirm URLs are discoverable, then add patterns.
- No links found — The starting URL may not contain links. Try using
source: "sitemaps", increasing the
depthparameter, or setting
includeSubdomainsor
includeExternalLinksto
true.
If a crawl job remains in
running status for an extended period:
- Slow page loads — Pages with heavy JavaScript take longer to render. Use
render: falseif the content you need is in the initial HTML.
- Rate limiting — Sites with strict rate limits slow crawling. The crawler respects
robots.txt
Crawl-delayand implements backoff. Reduce
limitand run multiple smaller crawls.
- Unnecessary resources — Block resources that are not needed for content extraction using
rejectResourceTypes(for example,
image,
media,
font).
A
cancelled_due_to_limits status means your account hit its browser time limit. Workers Free plan accounts are capped at 10 minutes of browser use per day. To resolve this:
- Upgrade to a Workers Paid plan for higher limits.
- Use
render: falsefor static content to avoid consuming browser time.
- Increase
maxAgeto use cached results where possible.
- Reduce the
limitparameter.
If the
json format returns null or empty results:
- Provide a clear prompt — Be specific about what data to extract and where it appears on the page (for example, "Extract the product name, price, and description from the main product section").
- Define a response schema — Use
response_formatwith a JSON schema to enforce the expected output structure.
- Use a custom model — If the default Workers AI model does not produce the desired results, use the
custom_aiparameter to specify a different model. Refer to Using a custom model (BYO API Key) for details.
If you have questions or encounter other errors, refer to the Browser Rendering FAQ and troubleshooting guide.
