The single biggest determinant of RAG quality — once you have picked a half-decent embedding model and a half-decent generation model — is your chunking strategy. Not your prompt. Not your re-ranker. The chunks. If your chunks are wrong, no amount of prompt engineering or vector-store tuning will save the system, because the model is being asked to answer questions from evidence it cannot see.

This is the strategy guide we wish we had when we started shipping RAG systems in 2023. It collects what works, what does not, and the surprisingly small set of decisions that move the most quality.

1. Why chunking is the bottleneck

RAG works in three steps: split documents into chunks, index the chunks in a vector store, and at query time pull the top-k chunks and hand them to the model. The model only sees what you retrieve. Everything you do not retrieve is, for that query, invisible. So the chunks are the evidence pool, and the quality of the evidence pool sets the ceiling for the quality of the answer.

The pathological failure mode is silent: the model produces a confident, fluent, plausible answer that is wrong, because the right chunk was sitting two slots below the cutoff. Users do not see the cutoff. They see the wrong answer and lose trust in the system. Fix chunking and most "hallucination" complaints quietly disappear.

2. Fixed-size chunking: the default that is usually wrong

Most quickstarts ship with fixed-size chunking: cut the document into N-token windows with M tokens of overlap. It is the easiest thing to implement, which is why every demo uses it, and it is almost always the wrong production answer. The reason is structural: documents are not uniform. A 512-token window cuts a paragraph in half, separates a heading from the explanation it heads, and slices tables down the middle. The vector index then ranks those mutilated chunks against each other.

Fixed-size is a reasonable choice for one specific case: unstructured prose with no headings, no tables, no code blocks. Chat transcripts. Free-form notes. Voicemail transcriptions. For anything with structure — technical documentation, knowledge-base articles, policy documents, contracts — you can do better.

3. Structure-aware chunking

The first big upgrade is to chunk along structural boundaries: chapter, section, sub-section, paragraph, list, table. For Markdown the boundaries are explicit. For HTML you walk the DOM. For PDFs you reconstruct the heading hierarchy from font size and weight. For DOCX you read the styles.

The rule is "merge upward to a budget." Start at the smallest semantic unit (paragraph, list item, table row). Concatenate adjacent units of the same parent section until the chunk reaches your token budget (say, 400–800 tokens). Never cross a section boundary mid-chunk. Always prepend the breadcrumb of headings to the chunk text — "Chapter 4 > Refunds > EU customers > ..." — so that the chunk is interpretable on its own, without the surrounding document.

This single change typically improves retrieval recall by 15–30% on real corpora, with zero infrastructure change.

4. Overlap, properly used

Overlap exists to mitigate the "the right sentence is the first sentence of the next chunk" problem. The amount of overlap matters less than how you use it. Two principles:

  • Overlap should respect sentence boundaries, not token boundaries. Cutting an overlap mid-word produces noise.
  • For structured documents with breadcrumbs (see above) you need almost no overlap, because each chunk already knows its place in the document. We typically use 0–50 tokens of overlap for structured corpora and 50–120 for unstructured ones.

5. Tables are not text

Tables are the single most common source of bad RAG answers in technical and financial corpora. The naive thing — serialise the table as Markdown and chunk it like prose — works for tables of less than ten rows and fails catastrophically for anything larger, because the row-column relationships are lost as soon as the table breaks across two chunks.

The robust pattern is to extract tables as structured objects (JSON or HTML) and chunk them separately, one chunk per row or per small group of related rows, with the column headers prepended to each chunk. Index both the structured form (for analytical queries) and a one-sentence summary (for navigational queries). At query time, retrieve from both indexes and merge.

6. Code, configuration, and command output

Code blocks have the same problem as tables — line-level relationships matter — with the additional twist that the boundary of a "useful chunk" is a function definition, a class, or a configuration section, not a token count. For source code, lean on a real parser (tree-sitter is excellent), chunk by function or class, and include the surrounding module-level docstring as breadcrumb. For YAML / JSON configuration, chunk by top-level key. For command output, chunk by command + output pair.

7. Metadata is cheap retrieval rocket fuel

Every chunk should carry rich metadata: source document, section path, document type, last-updated date, owning team, language. Most vector stores let you filter at retrieval time on metadata before doing the vector search. That filter is often dramatically more selective than the vector search itself.

Example: a user asks "what's our refund policy in France?" A pure-vector search returns the top 10 semantically-similar chunks across 50 K documents. A filtered search — "country = FR AND topic = refunds" then top 10 by vector — returns a tighter, more relevant set every time. The metadata filter is doing most of the work; the vector search just ranks within the filtered set.

8. Multi-representation indexing

For long documents, index each chunk three ways: the raw text, a one-sentence summary, and a list of hypothetical questions the chunk could answer. Each representation is a separate vector. At query time, search all three and merge by reciprocal rank fusion. The cost is 3× the embedding work; the benefit is a meaningful recall improvement on queries that are phrased very differently from the source material.

This is the "multi-representation" or "hypothetical document embeddings" pattern. It is most useful for FAQ-style queries against dense technical content.

9. Re-ranking: cheap, high leverage

Pull top-30 chunks from the vector store, then re-rank them with a cross-encoder (Cohere Rerank, BGE-Rerank, Voyage Rerank-2). Keep the top-5 for the model context. Re-ranking adds 100–300 ms of latency and often improves answer quality more than swapping to a more expensive generation model. It is the lowest-cost, highest-leverage change you can make once basic retrieval is in place.

10. Measure, do not guess

Every chunking change should be measured against a held-out evaluation set: 50–200 real user queries paired with the chunks (or documents) a domain expert says contain the answer. Measure recall at k (does the right chunk appear in the top-k?), not just final answer quality, because answer quality conflates retrieval with generation. The eval set is the most valuable asset in the RAG project; invest in it before you invest in clever indexing tricks.

11. Where ConvoSuite fits

ConvoSuite's knowledge-base loader ships with structure-aware chunking, breadcrumb prepending, table extraction, metadata filtering, multi-representation indexing, and re-ranking as defaults. You can tune any of them, but the defaults are the patterns above, tested on production corpora across telco, fintech, and SaaS verticals. If you have a corpus that does not behave with off-the-shelf RAG, the chances are very high that the answer is in one of these eleven sections — we are happy to do a free chunking review and tell you which one.