How to Choose Chunk Size for RAG (With 7 Chunking Strategies & Trade-offs)
What Is Chunking?
When you build a RAG (Retrieval-Augmented Generation) system, you can't feed an entire 200-page document into an LLM at once. So you split it into smaller pieces — chunks — store them in a vector database, and retrieve only the relevant ones at query time.
Chunking is that splitting step. It sounds simple. It isn't.
The way you split text directly affects what your LLM can find and how well it can answer. A bad chunk gives the model half a thought. A good chunk gives it exactly the context it needs.
Note
Why One Strategy Doesn't Fit All
Think of chunking like packing boxes. If you throw everything into boxes of exactly the same size, a short poem and a legal contract end up treated the same way. That's a problem — they have completely different structures and meaning densities.
Different documents, different use cases, and different queries all need different chunking approaches. The seven strategies below each solve a specific problem. Know which one to reach for.
A Word on Tokens vs Characters
Before diving in, one critical clarification: LLMs think in tokens, not characters.
A token is roughly 3–4 characters in English. chunk_size=500 in character terms is approximately 125–170 tokens — not 500 tokens. This gap matters because your embedding model and LLM both have token-based context limits, not character limits.
Common mistake
The code examples below that use character slicing are kept simple for readability. In production, use a tokenizer:
| 1 | import tiktoken |
| 2 | |
| 3 | def token_count(text, model="gpt-3.5-turbo"): |
| 4 | enc = tiktoken.encoding_for_model(model) |
| 5 | return len(enc.encode(text)) |
| 6 | |
| 7 | def fixed_token_chunk(text, chunk_size_tokens=256, model="gpt-3.5-turbo"): |
| 8 | enc = tiktoken.encoding_for_model(model) |
| 9 | tokens = enc.encode(text) |
| 10 | chunks = [] |
| 11 | for i in range(0, len(tokens), chunk_size_tokens): |
| 12 | chunk_tokens = tokens[i:i + chunk_size_tokens] |
| 13 | chunks.append(enc.decode(chunk_tokens)) |
| 14 | return chunks |
Rule of thumb
The Seven Chunking Strategies
1. Fixed-Size Chunking
Split text into chunks of exactly N tokens (or characters for quick prototyping), regardless of content.
| 1 | # Character-based — for illustration only, not token-accurate |
| 2 | def fixed_size_chunk(text, chunk_size=500): |
| 3 | # NOTE: chunk_size here is in characters (~125–170 tokens) |
| 4 | # Use the token-aware version above for production |
| 5 | return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)] |
How it works: Count to your limit, cut, repeat. No understanding of content structure.
When to use
Warning
Where it excels: Uniform data like chat logs or structured records where each line is self-contained.
Where it fails: Narrative text, code, or legal documents where a sentence split destroys meaning.
2. Sliding Window Chunking
Same as fixed-size, but each chunk overlaps with the previous one by a set number of tokens.
| 1 | def sliding_window_chunk(text, chunk_size=500, overlap=100): |
| 2 | # chunk_size and overlap are in characters here |
| 3 | # Scale to tokens in production (overlap ≈ 20-15% of chunk_size is typical) |
| 4 | chunks = [] |
| 5 | start = 0 |
| 6 | while start < len(text): |
| 7 | end = start + chunk_size |
| 8 | chunks.append(text[start:end]) |
| 9 | start += chunk_size - overlap |
| 10 | return chunks |
How it works: Move a window across the text, but slide it forward by less than the full chunk size. The repeated section is the overlap — a safety net for information that lands at a boundary.
When to use
Common mistake
Where it excels: Long-form articles where a key sentence might land at the edge of a chunk.
Where it fails: Still cuts through sentences — just less often. Not a substitute for structure-aware splitting.
3. Paragraph Chunking
Split on paragraph boundaries — double newlines or other natural text breaks.
| 1 | def paragraph_chunk(text, max_tokens=400): |
| 2 | paragraphs = text.split("\n\n") |
| 3 | chunks, current, current_len = [], [], 0 |
| 4 | |
| 5 | for para in paragraphs: |
| 6 | para = para.strip() |
| 7 | if not para: |
| 8 | continue |
| 9 | # Rough token estimate: characters / 4 |
| 10 | para_len = len(para) // 4 |
| 11 | if current_len + para_len > max_tokens and current: |
| 12 | chunks.append("\n\n".join(current)) |
| 13 | current, current_len = [], 0 |
| 14 | current.append(para) |
| 15 | current_len += para_len |
| 16 | |
| 17 | if current: |
| 18 | chunks.append("\n\n".join(current)) |
| 19 | return chunks |
How it works: Respect the author's own structure. Each paragraph becomes a chunk, grouped if they're short.
When to use
Where it excels: Editorial content. Each paragraph usually covers one topic, so retrieval maps naturally to user intent.
Where it fails: Academic papers sometimes have 500-word paragraphs. One chunk becomes too large to embed meaningfully, and the signal gets diluted.
4. Sentence Chunking
Split on sentence boundaries, then group N sentences per chunk.
| 1 | import nltk |
| 2 | |
| 3 | def sentence_chunk(text, sentences_per_chunk=3): |
| 4 | sentences = nltk.sent_tokenize(text) |
| 5 | chunks = [] |
| 6 | for i in range(0, len(sentences), sentences_per_chunk): |
| 7 | chunk = " ".join(sentences[i:i + sentences_per_chunk]) |
| 8 | chunks.append(chunk) |
| 9 | return chunks |
Note
nltk.sent_tokenize handles abbreviations and edge cases better than splitting on . alone. Download the punkt tokenizer first with nltk.download('punkt').How it works: Detect sentence endings, group N sentences per chunk.
When to use
Where it excels: Short, self-contained answers. Great retrieval precision when questions map to a single sentence or two.
Where it fails: Technical documentation where a concept builds across many sentences. Splitting too fine loses the context that makes an answer meaningful.
5. Recursive / Hierarchical Chunking
Try to split by the largest meaningful unit first (section → paragraph → sentence → word), falling back only when a chunk still exceeds your size limit.
| 1 | from langchain.text_splitter import RecursiveCharacterTextSplitter |
| 2 | |
| 3 | splitter = RecursiveCharacterTextSplitter( |
| 4 | chunk_size=1000, # characters (~250 tokens) |
| 5 | chunk_overlap=100, # characters (~25 tokens) |
| 6 | separators=["\n\n", "\n", ".", "!", "?", " ", ""] |
| 7 | ) |
| 8 | chunks = splitter.split_text(text) |
How it works: Try \n\n first. If the result is still too large, try \n. Still too large? Try .. It cascades down the hierarchy until chunks fit your size limit.
Rule of thumb
Where it excels: Mixed documents — a PDF with headers, paragraphs, bullet points, and code all in one file.
Where it fails: Documents with no natural separators (e.g., one giant unformatted string). The fallback to character splitting will kick in and you're back to fixed-size behavior.
6. Semantic Chunking
Group sentences together as long as they're semantically similar. Start a new chunk when the topic shifts — measured by a drop in embedding similarity between consecutive sentences.
| 1 | from sentence_transformers import SentenceTransformer |
| 2 | import numpy as np |
| 3 | import nltk |
| 4 | |
| 5 | model = SentenceTransformer("all-MiniLM-L6-v2") |
| 6 | |
| 7 | def cosine_similarity(a, b): |
| 8 | # Explicit cosine similarity — dot product alone is NOT cosine similarity |
| 9 | # unless vectors are already normalized |
| 10 | return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) |
| 11 | |
| 12 | def semantic_chunk(text, threshold=0.75): |
| 13 | sentences = nltk.sent_tokenize(text) |
| 14 | if len(sentences) < 2: |
| 15 | return [text] |
| 16 | |
| 17 | embeddings = model.encode(sentences, normalize_embeddings=True) |
| 18 | # With normalize_embeddings=True, dot product == cosine similarity |
| 19 | # Without it, you MUST divide by the product of norms |
| 20 | |
| 21 | chunks, current = [], [sentences[0]] |
| 22 | for i in range(1, len(sentences)): |
| 23 | sim = np.dot(embeddings[i - 1], embeddings[i]) # valid only because normalized |
| 24 | if sim < threshold: |
| 25 | chunks.append(" ".join(current)) |
| 26 | current = [] |
| 27 | current.append(sentences[i]) |
| 28 | |
| 29 | chunks.append(" ".join(current)) |
| 30 | return chunks |
Warning
np.dot(a, b) is only cosine similarity if vectors are unit-normalized. If you use raw embeddings without normalization, always compute: np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)). Getting this wrong silently produces invalid similarity scores and unpredictable chunk boundaries.How it works: Embed each sentence. When two consecutive sentences are semantically far apart (cosine similarity drops below your threshold), cut there.
When to use
Where it excels: Finding natural topic transitions that fixed-size or paragraph methods would miss. Chunks are more topically coherent, which improves retrieval precision.
Where it fails: Slow and expensive at scale (you're embedding every sentence). Threshold tuning is non-trivial — too tight and you over-split, too loose and unrelated content merges. Start at 0.7–0.8 and evaluate with your actual queries.
7. LLM-Based Chunking
Ask an LLM to read the document and decide where to split it — the way a human editor would.
| 1 | import anthropic |
| 2 | import json |
| 3 | |
| 4 | client = anthropic.Anthropic() |
| 5 | |
| 6 | def llm_chunk(text): |
| 7 | response = client.messages.create( |
| 8 | model="claude-sonnet-4-6", |
| 9 | max_tokens=2048, |
| 10 | messages=[{ |
| 11 | "role": "user", |
| 12 | "content": f"""Split the following text into logically self-contained chunks. |
| 13 | Each chunk should cover one complete idea or topic. |
| 14 | |
| 15 | Return ONLY a valid JSON array. Each element should have: |
| 16 | - "title": a short label describing the chunk's topic |
| 17 | - "content": the chunk text |
| 18 | |
| 19 | Text to split: |
| 20 | {text}""" |
| 21 | }] |
| 22 | ) |
| 23 | |
| 24 | raw = response.content[0].text |
| 25 | # Strip markdown code fences if the model wraps its output |
| 26 | clean = raw.strip().removeprefix("```json").removesuffix("```").strip() |
| 27 | return json.loads(clean) |
How it works: The LLM reads the full text, understands it, and makes intelligent split decisions. It can handle unusual structures, implicit topic shifts, and domain-specific logic that no rule can encode.
When to use
Where it excels: Unstructured or unusual documents where rules-based methods struggle. The model can understand authorial intent, not just whitespace patterns.
Where it fails: Expensive and slow. At scale, this can cost 10–100x more than other methods. LLM context limits also mean very long documents need a pre-pass to break them into sections first. Not viable for bulk ingestion pipelines.
Warning
`json ) even when you ask for raw JSON. Always strip fences before parsing, or your json.loads() will throw.How to Choose: A Decision Framework
Before picking a strategy, answer these three questions:
1. What's the structure of your document?
Well-structured (headers, paragraphs) → Recursive or Paragraph. Unstructured blob → Semantic or LLM. Uniform/tabular → Fixed-size.
2. How large is your dataset?
Thousands of documents → Recursive or Fixed-size. Dozens of high-value documents → Semantic or LLM.
3. What kind of queries will you serve?
Short factual queries → Smaller chunks (200–300 tokens). Reasoning-heavy queries needing context → Larger chunks (400–600 tokens).
Rule of thumb
Quick Comparison
| Strategy | Speed | Cost | Semantic Quality | Best For |
|---|---|---|---|---|
| Fixed-size | ⚡ Fastest | 💰 Lowest | ⭐ Basic | Logs, prototypes |
| Sliding window | ⚡ Fast | 💰 Low | ⭐⭐ Better | Boundary-sensitive text |
| Paragraph | ⚡ Fast | 💰 Low | ⭐⭐ Good | Articles, blogs |
| Sentence | ⚡ Fast | 💰 Low | ⭐⭐ Good | FAQs, short answers |
| Recursive | 🔄 Medium | 💰 Low | ⭐⭐⭐ Great | Mixed documents (default) |
| Semantic | 🐢 Slow | 💰 Medium | ⭐⭐⭐ Great | Multi-topic long docs |
| LLM-based | 🐢 Slowest | 💰 High | ⭐⭐⭐⭐ Best | High-value unstructured docs |
Takeaways
- Chunking is a core RAG design decision — bad chunks mean bad retrieval, regardless of how good your LLM is.
- Characters ≠ tokens. Always think in tokens when sizing chunks. 500 characters ≈ 125 tokens.
- Start with recursive chunking at 200–400 tokens with 10–15% overlap. It handles most document types well.
- Use semantic chunking when documents shift topics without structural markers like headers.
- Reserve LLM chunking for high-value documents — the quality jump is real, but so is the cost.
- When computing similarity between embeddings, use cosine similarity, not raw dot product — unless your vectors are explicitly unit-normalized.
- Retrieval quality is your real metric. After chunking, test with real queries and measure whether the right chunks come back. Chunk size tuning without this feedback loop is guesswork.
- Smaller chunks improve retrieval precision (finding the right chunk). Larger chunks improve answer quality (giving the model enough context). You're always balancing both.