---
title: "Building Karpathy's Knowledge Base Without Embeddings"
date: Fri Apr 03 2026 00:00:00 GMT+0000 (Coordinated Universal Time)
excerpt: "How to build an LLM knowledge base without embeddings or a vector database. I replaced the standard RAG pipeline — chunking, embedding, similarity search — with an LLM that reads a markdown summary table and selects which documents to read in full. Working code included."
template: "technical"
category: "AI Engineering"
---
I've re-read the same 300-page lease four times this year. Not because I forgot what's in it. Because my tools forgot.

Andrej Karpathy [posted](https://x.com/karpathy/status/2039805659525644595) about the same frustration:

![Andrej Karpathy's tweet about building LLM-powered knowledge bases by dumping documents into a folder and compiling them into a wiki](/images/karpathy-tweet-cropped.png)

Dump documents into a folder. An LLM compiles them into a wiki. Ask questions. He called it "a hacky collection of scripts" and said there's room for "an incredible new product."

Then he published [the full pattern as a gist](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f) — architecture, operations, indexing. The key line:

> *"This works surprisingly well at moderate scale (~100 sources, ~hundreds of pages) and avoids the need for embedding-based RAG infrastructure."*

[@jumperz](https://x.com/jumperz/status/2039826228224430323) drew the best summary:

![Architecture diagram of Karpathy's LLM Knowledge Base showing the flow from sources to raw files to wiki to Q&A to output](/images/Modules.jpg)
*Sources → raw/ → Wiki → Q&A → Output. The LLM writes everything. You just steer. Credit: [@jumperz](https://x.com/jumperz/status/2039826228224430323)*

I looked at the replies. Everyone's reaching for vector databases, embedding pipelines, chunking strategies.

Ever since I wrote [Why I Stopped Using RAG](/articles/why-i-stopped-using-rag-for-document-processing), the most common question I get is: "How do you actually do retrieval without embeddings?" People ask on LinkedIn, in DMs, in comments. I keep answering the same way — LLM reads a summary table, picks files, reads them in full.

So instead of answering one more time, I'm building it in public. Working code. A series where each post ships something you can run. If it works at your scale, use it. If it doesn't, you'll know exactly where to add embeddings back.

---

## Why embeddings and chunking fail for knowledge bases

Ask "how do I build Q&A over documents" and you'll get this:

```
Docs → Chunk → Embed → Vector DB
  → Query → Search → Top-K → LLM
```

Every arrow is a decision. Chunk size? Overlap? Which embedding model? Which vector DB? Similarity threshold? How many chunks to retrieve? Each decision affects answer quality. Each component adds infrastructure you have to run, tune, and debug.

And here's what nobody talks about: **chunking destroys context.** You take a 30-page financial report, chop it into 500-token pieces, embed each piece, and hope that when someone asks a question, cosine similarity finds the right pieces. It often doesn't. The answer spans two chunks. The context is in the header three pages up. The table caption is in one chunk and the table is in another.

I [wrote about this in detail](/articles/why-i-stopped-using-rag-for-document-processing) — the retriever and the reasoner are different systems, and the retriever guesses wrong on anything that requires cross-referencing.

## How retrieval works without embeddings

```
Docs → Parse → Summarize → index.md
  → Query → LLM picks files → Reads in full
```

`index.md` is a table. Each row has a filename, a one-line summary, and topic keywords:

```markdown
| Source | Type | Summary | Topics |
|--------|------|---------|--------|
| q3-results.md | PDF | Q3 revenue $2.3M, flat Q4 guidance | finance, quarterly |
| pipeline.md | Excel | 142 deals, $4.1M weighted pipeline | sales, pipeline |
| board-deck.md | PPTX | Strategy shift to enterprise | strategy, hiring |
```

When you ask "what changed in Q4 guidance?", the agent reads this table — about 200 tokens for 50 sources — and picks which files to read. Then it reads those files **in full**. Not chunks. The whole document.

Think of it like a librarian. If the library has 200 books, a good librarian reads the card catalog and knows exactly which ones matter. She doesn't need a search engine. She reads the summaries, understands your question, and makes a judgment call.

That's what the LLM does. And at this scale, its judgment is better than cosine similarity — because the LLM actually understands what you're asking. An embedding model just compares vectors.

| Factor | Embeddings + Vector DB | LLM reads index.md |
|--------|----------------------|---------------------|
| Retrieval | Cosine similarity on chunks | LLM judgment on summaries |
| What the model sees | Top-K fragments | Full documents |
| Cross-referencing | Broken by chunking | Intact — reads whole file |
| Infrastructure | Embedding model + vector DB + chunking pipeline | Markdown files in a folder |
| Scale | Millions of documents | 50-500 documents |
| Cost per query | Low (one embed + one LLM call) | Higher (index + 3-7 full files) |

**When does this break?** Past ~500 documents. The index gets too long, file selection gets noisy, token costs spike. If you need to find one sentence across 5,000 documents, vector search wins. I'm making that tradeoff deliberately.

The right infrastructure is the least infrastructure that solves the problem. If I outgrow this, I add retrieval. The agent already reads files through tools — swapping "read file" for "search index" changes one component, not the architecture.

## Architecture: file watcher, agent sessions, eval loop

Three zones. [Part 2](/articles/building-karpathy-knowledge-base-part-2) shows how Pi SDK sessions make this work with almost no code:

```
┌──────────────────────────────────────┐
│                                      │
│  FILE WATCHER  (built ✓)             │
│  Parse PDFs → Build index.md         │
│  Watch for new files → re-index      │
│                                      │
│  AGENT SESSIONS  (Part 3-4)          │
│  Query: read-only → answer           │
│  Research: read+write → save to wiki │
│                                      │
│  EVAL LOOP  (Part 5)                 │
│  Trace queries → check citations     │
│  Flag inconsistencies                │
│                                      │
└──────────────────────────────────────┘
```

The eval loop matters because of a question nobody in the Karpathy thread is asking: **how do you know the wiki is right?** An LLM compiles 100 articles into summaries — did it get the facts right? Every query gets traced. The eval agent checks citations against source files. I wrote about [why bounding boxes matter for verification](/articles/parsing-pdfs-with-bounding-boxes) — per-word coordinates let you confirm that a cited page actually says what the LLM claims.

## Try it: one command to start

The ingest pipeline is working today. No vector database to set up. No embedding model to choose. No chunking strategy to debate:

```bash
npx llm-kb run ./my-documents
```

```
llm-kb v0.0.1

Scanning ./my-documents...
  Found 9 files (9 PDF)
  9 parsed

  Building index...
  Index built: .llm-kb/wiki/index.md

  Watching for new files... (Ctrl+C to stop)
```

PDFs become markdown + bounding boxes. The index builds. The watcher starts. Drop a PDF in while it's running — parsed and indexed automatically. Original files never touched.

## The series

Each post ships working code:

| Part | What |
|---|---|
| **1** | No embeddings — why it works (this post) |
| **2** | [Pi SDK sessions as RAG](/articles/building-karpathy-knowledge-base-part-2) |
| **3** | [Every query makes it smarter — the compounding loop](/articles/building-karpathy-knowledge-base-part-3) |
| **4** | Web UI |
| **5** | Eval — traces, checks, reports |
| **6** | Docker |
| **7** | Citations — [bounding boxes](/articles/parsing-pdfs-with-bounding-boxes) → highlight source text |

[GitHub →](https://github.com/satish860/llm-kb)

---

**Related:** [Why I Stopped Using RAG for Document Processing](/articles/why-i-stopped-using-rag-for-document-processing) · [PDF Parsing With Bounding Boxes](/articles/parsing-pdfs-with-bounding-boxes)

*[Pi SDK](https://github.com/mariozechner/pi) · [LiteParse](https://github.com/run-llama/liteparse) · [Karpathy's post](https://x.com/karpathy/status/2039805659525644595)*