The Thing RAG Never Solved
I've built enough RAG pipelines to know where they quietly fail. Not on demos — on the second or third month of production, when users start asking questions that require connecting three documents you ingested six weeks apart.
RAG is stateless. Every query is day one. The LLM retrieves chunks, synthesizes an answer, and throws away everything it just figured out. Ask the same question tomorrow — same retrieval, same re-synthesis, same ephemeral answer.
That's not a retrieval problem. That's a knowledge architecture problem.
What Karpathy Posted
On April 3, 2026, Karpathy posted on X:
"LLM Knowledge Bases — Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge."
The tweet went viral. He followed it the next day with a GitHub Gist — a full idea file laying out the system architecture and philosophy.
The premise, stated simply:
Don't retrieve at query time. Compile at ingestion time.
Instead of dumping raw documents into a vector store, you use an LLM to read every incoming source and write a wiki — structured markdown with concept pages, cross-references, summaries, and backlinks. That wiki is the thing you query against. It persists. It compounds.
What Himanshu's Diagram Actually Added
Karpathy described the idea in prose. Himanshu drew the full system.
His diagram made explicit what the tweet left implicit — specifically three things most commentary missed:
The output layer. Karpathy talked about the wiki itself. Himanshu's diagram showed that the wiki is an intermediate artifact, not the end product. From the wiki, the LLM can generate outputs: structured Markdown reports, slides, and charts via Matplotlib. The wiki becomes a source of derived knowledge products, not just a place to ask questions.
The feedback loop. The diagram showed a "Filed back → C to wiki" arrow connecting Q&A output back into the wiki. When you ask a question and get a good answer, that answer becomes a new wiki page. The knowledge base grows from two directions: source ingestion and query exploration. This is the part that makes it genuinely compound — not just inputs accumulating, but your own questions becoming knowledge.
The future directions. The bottom of the diagram shows two end goals Karpathy mentioned but didn't elaborate on: synthetic data gen (fine-tune a small model on the clean wiki) and product vision (beyond hacky scripts, into a real system). That arc — from manual ingestion → structured wiki → training data → custom model — is the real long-term payoff, and Himanshu made it visible as a roadmap rather than a footnote.
The Architecture
| 1 | raw/ ← immutable source files you add (articles, papers, repos, images) |
| 2 | wiki/ ← LLM-owned markdown (concepts, summaries, backlinks) |
| 3 | schema.md ← configuration that tells the LLM how to maintain the wiki |
| 4 | index.md ← table of contents (~100 articles, one-line summaries) |
| 5 | log.md ← append-only record of every ingest/lint/query event |
The LLM runs four operations against this structure:
Compile — new source arrives in raw/, LLM reads it and writes or updates 10–15 wiki pages in one pass. Done once at ingestion time. Not repeated at query time.
Q&A — user asks a question, LLM reads index.md first (to find relevant pages), loads those specific pages, synthesizes an answer with wiki-link citations. If the answer is worth keeping, it gets filed back as a new page.
Lint — periodic health check across the full wiki. Finds contradictions between pages, stale claims, orphaned concepts, missing links. Runs weekly or after significant ingestion.
Index — summaries and backlinks stay current as pages are added or updated.
What Backlinks, Index, and Lint Actually Mean
These three words get thrown around but they're doing specific jobs. Worth understanding concretely.
Backlinks — imagine a wiki page on Transformer Architecture. Inside it, you reference Attention Mechanism and Positional Encoding. Those are forward links. A backlink is the reverse: the Attention Mechanism page automatically knows it was mentioned by Transformer Architecture, BERT, and GPT-2. When you open the Attention page, you see the full web of pages that depend on it. That's the knowledge graph. In RAG, every document is an island — no page knows it exists in relation to anything else.
Index — once your wiki grows past ~100 pages, the LLM can't hold it all in context. The index is a lightweight table of contents: page title, one-line summary, category. When you ask a question, the LLM reads the index first to decide which wiki pages are relevant, then loads only those. Same reason a textbook has a table of contents before 600 pages — you don't re-read everything to find one concept.
Lint — here's a concrete example. In March you ingested a paper saying "GPT-4 context window is 8K tokens." In April you added OpenAI's updated docs saying it's 128K. RAG doesn't care — both chunks coexist, one will win retrieval based on score. A lint pass reads the full wiki, flags that contradiction, and either updates the stale page or surfaces it for your review. It also catches orphaned pages (a concept referenced everywhere but no page actually exists for it) and dead ends (pages nothing links to, probably shouldn't be in the wiki). This is what makes the wiki trustworthy over time rather than just large.
Note
The Schema File — the Most Important Piece
schema.md (or CLAUDE.md / AGENTS.md depending on your agent) is what makes this system coherent across sessions. It's the persistent configuration that tells the LLM exactly how to behave when it touches the wiki.
A minimal skeleton:
| 1 | # Wiki Schema |
| 2 | |
| 3 | ## Directory Structure |
| 4 | - `raw/` — immutable. Never modify source files. |
| 5 | - `wiki/` — you own this entirely. Create, update, delete pages as needed. |
| 6 | - `raw/assets/` — local copies of images, referenced in wiki via relative paths. |
| 7 | |
| 8 | ## Page Format |
| 9 | Every wiki page must have this frontmatter: |
| 10 | |
| 11 | --- |
| 12 | title: <page title> |
| 13 | type: concept | entity | source-summary | comparison |
| 14 | sources: [list of raw/ files this page draws from] |
| 15 | related: [list of wiki pages linked from this page] |
| 16 | created: YYYY-MM-DD |
| 17 | updated: YYYY-MM-DD |
| 18 | confidence: high | medium | low |
| 19 | --- |
| 20 | |
| 21 | Body: TLDR (2-3 sentences) → main content → counterarguments or caveats |
| 22 | |
| 23 | ## Ingest Workflow |
| 24 | When a new file arrives in raw/: |
| 25 | 1. Read it fully before writing anything. |
| 26 | 2. Identify 3–8 key concepts. For each: find existing wiki page or create one. |
| 27 | 3. Update cross-references — if you mention concept X, link to wiki/X.md. |
| 28 | 4. Add backlinks — update the related: field on every page you reference. |
| 29 | 5. Append to log.md: `## [YYYY-MM-DD] ingest | <source title>` |
| 30 | |
| 31 | ## Query Workflow |
| 32 | When asked a question: |
| 33 | 1. Read index.md first. Identify relevant pages. |
| 34 | 2. Load those pages. Synthesize answer with [[wiki-link]] citations. |
| 35 | 3. If the answer is worth keeping, ask: "File this as a wiki page?" |
| 36 | |
| 37 | ## Lint Workflow |
| 38 | Scan the full wiki and report: |
| 39 | - Contradictions between pages (same claim, different values) |
| 40 | - Orphaned pages (no incoming backlinks) |
| 41 | - Dead references (links to pages that don't exist) |
| 42 | - Stale dates (confidence: high on pages older than 6 months) |
Rule of thumb
The Query → Wiki Feedback Loop
This is the part that makes it genuinely compound and it's easy to miss.
Standard flow: you ask a question, get an answer, close the chat. Tomorrow that synthesis is gone.
LLM Wiki flow: you ask a question, get an answer with citations like [[Attention Mechanism]] and [[Scaling Laws]]. The LLM then asks: "This answer synthesizes three sources in a way that isn't currently captured anywhere in the wiki. Should I file it as a new page?" You say yes, it writes wiki/attention-vs-scaling-tradeoffs.md, links it to the related pages, updates the index.
Now that question you asked is permanently part of the knowledge base. The next time someone asks something adjacent, the LLM finds that synthesis directly — it doesn't have to re-derive it.
Your curiosity compounds. That's the actual value, and it doesn't happen in RAG at all.
RAG vs LLM Wiki — the honest comparison
The difference isn't retrieval quality. It's whether knowledge accumulates.
| 1 | RAG: ingest → embed → store → [query: retrieve → synthesize → discard] |
| 2 | LLM Wiki: ingest → compile → store → [query: read wiki → synthesize → file back] |
With RAG, three papers on the same topic stay three papers. With LLM Wiki, after ingestion they're one concept page with contradictions flagged and relationships explicit.
"The tedious part of maintaining a knowledge base is not the reading or the thinking — it's the bookkeeping." — Karpathy's Gist
LLMs are infinitely patient bookkeepers.
The Code
| 1 | import anthropic |
| 2 | import os |
| 3 | from pathlib import Path |
| 4 | |
| 5 | client = anthropic.Anthropic() |
| 6 | |
| 7 | def load_schema(wiki_root: str) -> str: |
| 8 | schema_path = Path(wiki_root) / "schema.md" |
| 9 | return schema_path.read_text() if schema_path.exists() else "" |
| 10 | |
| 11 | def load_wiki_index(wiki_root: str) -> str: |
| 12 | index_path = Path(wiki_root) / "wiki" / "index.md" |
| 13 | return index_path.read_text() if index_path.exists() else "" |
| 14 | |
| 15 | def compile_source(raw_file: str, wiki_root: str) -> None: |
| 16 | """Ingest a new raw source into the wiki.""" |
| 17 | schema = load_schema(wiki_root) |
| 18 | source_text = Path(raw_file).read_text() |
| 19 | existing_index = load_wiki_index(wiki_root) |
| 20 | |
| 21 | response = client.messages.create( |
| 22 | model="claude-opus-4-6", |
| 23 | max_tokens=8096, |
| 24 | system=f"{schema}\n\nCurrent wiki index:\n{existing_index}", |
| 25 | messages=[{ |
| 26 | "role": "user", |
| 27 | "content": f"""New source to compile into the wiki: |
| 28 | |
| 29 | {source_text} |
| 30 | |
| 31 | Following the schema: |
| 32 | 1. Identify key concepts — check the index for existing pages to update vs new ones to create. |
| 33 | 2. Write each page in the required format (frontmatter + TLDR + body + caveats). |
| 34 | 3. Add backlinks — update related: fields on pages you reference. |
| 35 | 4. Append to log.md. |
| 36 | |
| 37 | Output format: for each file, write `=== wiki/filename.md ===` then the full content.""" |
| 38 | }] |
| 39 | ) |
| 40 | |
| 41 | # Parse and write pages from the response |
| 42 | _write_pages_from_response(response.content[0].text, wiki_root) |
| 43 | |
| 44 | |
| 45 | def lint_wiki(wiki_root: str) -> str: |
| 46 | """Run a health check across the full wiki.""" |
| 47 | schema = load_schema(wiki_root) |
| 48 | wiki_dir = Path(wiki_root) / "wiki" |
| 49 | |
| 50 | all_pages = {} |
| 51 | for md_file in wiki_dir.glob("**/*.md"): |
| 52 | all_pages[str(md_file.relative_to(wiki_root))] = md_file.read_text() |
| 53 | |
| 54 | pages_combined = "\n\n---\n\n".join( |
| 55 | f"FILE: {path}\n{content}" |
| 56 | for path, content in all_pages.items() |
| 57 | ) |
| 58 | |
| 59 | response = client.messages.create( |
| 60 | model="claude-opus-4-6", |
| 61 | max_tokens=4096, |
| 62 | system=schema, |
| 63 | messages=[{ |
| 64 | "role": "user", |
| 65 | "content": f"""Run a lint pass on the full wiki. Find: |
| 66 | - Contradictions (same claim, conflicting values across pages) |
| 67 | - Orphaned pages (no incoming backlinks from other pages) |
| 68 | - Dead references (links to pages that don't exist) |
| 69 | - Stale high-confidence claims on pages older than 6 months |
| 70 | |
| 71 | Wiki contents: |
| 72 | {pages_combined} |
| 73 | |
| 74 | Output a structured report with file:line references for each issue.""" |
| 75 | }] |
| 76 | ) |
| 77 | |
| 78 | report = response.content[0].text |
| 79 | lint_report_path = Path(wiki_root) / "lint-report.md" |
| 80 | lint_report_path.write_text(report) |
| 81 | return report |
| 82 | |
| 83 | |
| 84 | def query_wiki(question: str, wiki_root: str, file_back: bool = False) -> str: |
| 85 | """Query the wiki. Optionally file the answer back as a new page.""" |
| 86 | schema = load_schema(wiki_root) |
| 87 | index = load_wiki_index(wiki_root) |
| 88 | |
| 89 | # Step 1: identify relevant pages from index |
| 90 | routing = client.messages.create( |
| 91 | model="claude-opus-4-6", |
| 92 | max_tokens=512, |
| 93 | messages=[{ |
| 94 | "role": "user", |
| 95 | "content": f"Given this wiki index:\n{index}\n\nQuestion: {question}\n\nList the wiki page filenames most relevant to answering this. Filenames only." |
| 96 | }] |
| 97 | ) |
| 98 | |
| 99 | relevant_files = _parse_filenames(routing.content[0].text) |
| 100 | |
| 101 | # Step 2: load those pages and answer |
| 102 | wiki_dir = Path(wiki_root) / "wiki" |
| 103 | loaded_pages = "\n\n---\n\n".join( |
| 104 | (wiki_dir / f).read_text() |
| 105 | for f in relevant_files |
| 106 | if (wiki_dir / f).exists() |
| 107 | ) |
| 108 | |
| 109 | response = client.messages.create( |
| 110 | model="claude-opus-4-6", |
| 111 | max_tokens=2048, |
| 112 | system=f"{schema}\n\nWiki pages:\n{loaded_pages}", |
| 113 | messages=[{ |
| 114 | "role": "user", |
| 115 | "content": f"{question}\n\nUse [[wiki-link]] citations. If this answer synthesizes something not currently captured in the wiki, flag it for filing." |
| 116 | }] |
| 117 | ) |
| 118 | |
| 119 | answer = response.content[0].text |
| 120 | |
| 121 | if file_back: |
| 122 | # compile_source_from_text — same as compile_source but takes a string |
| 123 | # instead of a file path. Writes the answer to raw/ then runs compile. |
| 124 | compile_source_from_text(answer, wiki_root, source_type="query-result") |
| 125 | |
| 126 | return answer |
Note
_write_pages_from_response and _parse_filenames are simple string parsing helpers — split on the === wiki/filename.md === delimiter and write each chunk to disk. The schema is what makes the output structured enough to parse reliably.About qmd
When the wiki grows past ~100 pages, reading the full index every query starts to strain context. qmd (github.com/tobi/qmd) is a local CLI search engine built specifically for markdown knowledge bases. It runs BM25 full-text search, vector semantic search, and LLM re-ranking — all on-device via node-llama-cpp with GGUF models. No API calls.
| 1 | # Install |
| 2 | npm install -g qmd |
| 3 | |
| 4 | # Index your wiki |
| 5 | qmd index ./wiki |
| 6 | |
| 7 | # Hybrid search (BM25 + vector + rerank) |
| 8 | qmd query "how does attention scale with sequence length" |
It also exposes an MCP server, so your LLM agent can call it directly rather than you writing the routing step manually. At small scale you don't need it. Past 200 pages, you probably do.
Where This Actually Applies
This isn't a replacement for RAG at enterprise scale. Karpathy explicitly scopes it: a bounded, curated corpus — ~100 articles, a research domain you're actively building knowledge in.
Where it fits well:
- Personal research system (papers, blog posts, notes)
- Domain-specific internal knowledge base with slow-moving documents
- Any system where synthesis quality matters more than ingestion volume
Where RAG still wins:
- Large, fast-changing corpora
- Real-time document ingestion at scale
- Multi-tenant systems where per-user wikis aren't practical
What I'm Actually Taking From This
The part that landed hardest isn't the wiki format. It's the lint pass.
In every RAG pipeline I've built, there's dead knowledge — documents contradicting each other, concepts that got redefined as the domain evolved, relationships that exist in the data but never surface at query time. We don't clean it because it's tedious and there's no mechanism for it.
A periodic LLM pass that reads the whole knowledge base and surfaces inconsistencies — that's something worth bolting onto existing RAG pipelines even if you never adopt the full wiki pattern.
The bigger shift is treating the LLM as a knowledge author, not just a retriever. RAG asks the LLM to be a fast reader. LLM Wiki asks it to be a librarian. Those are different jobs, and the second one is actually closer to what LLMs are good at.
Where It Goes Next
Karpathy's gist ends with two future directions shown explicitly in Himanshu's diagram:
Synthetic data gen — once the wiki is clean and structured, fine-tune a smaller model on it. The curated knowledge base becomes training data. The wiki isn't just a retrieval artifact — it's a path to a private, domain-specific model without the overhead of a full fine-tuning pipeline on raw documents.
Beyond hacky scripts — proper tooling around the ingest/compile/lint loop. The open-source ecosystem is already moving fast here: MCP server implementations, Claude Code agent skills, and CLI tools appeared within days of the gist. The pattern is settling; the tooling is catching up.
The idea that a knowledge base should be continuously maintained by the same LLM that uses it — that feels like the direction everything is heading. RAG was the right answer when context windows were small and LLMs were slow. Both of those constraints are eroding. The architecture should follow.