Skip to content
Cloudflare Docs

Chunking

Chunking is the process of splitting large data into smaller segments before embedding them for search. AutoRAG uses recursive chunking, which breaks your content at natural boundaries (like paragraphs or sentences), and then further splits it if the chunks are too large.

What is recursive chunking

Recursive chunking tries to keep chunks meaningful by:

  • Splitting at natural boundaries: like paragraphs, then sentences.
  • Checking the size: if a chunk is too long (based on token count), it’s split again into smaller parts.

This way, chunks are easy to embed and retrieve, without cutting off thoughts mid-sentence.

Chunking controls

AutoRAG exposes two parameters to help you control chunking behavior:

  • Chunk size: The number of tokens per chunk.
    • Minimum: 64
    • Maximum: 512
  • Chunk overlap: The percentage of overlapping tokens between adjacent chunks.
    • Minimum: 0%
    • Maximum: 30%

These settings apply during the indexing step, before your data is embedded and stored in Vectorize.

Choosing chunk size and overlap

Chunking affects both how your content is retrieved and how much context is passed into the generation model. Try out this external chunk visualizer tool to help understand how different chunk settings could look.

For chunk size, consider how:

  • Smaller chunks create more precise vector matches, but may split relevant ideas across multiple chunks.
  • Larger chunks retain more context, but may dilute relevance and reduce retrieval precision.

For chunk overlap, consider how:

  • More overlap helps preserve continuity across boundaries, especially in flowing or narrative content.
  • Less overlap reduces indexing time and cost, but can miss context if key terms are split between chunks.

Additional considerations:

  • Vector index size: Smaller chunk sizes produce more chunks and more total vectors. Refer to the Vectorize limits to ensure your configuration stays within the maximum allowed vectors per index.
  • Generation model context window: Generation models have a limited context window that must fit all retrieved chunks (topK × chunk size), the user query, and the model’s output. Be careful with large chunks or high topK values to avoid context overflows.
  • Cost and performance: Larger chunks and higher topK settings result in more tokens passed to the model, which can increase latency and cost. You can monitor this usage in AI Gateway.