A fully local RAG system answers questions from your own documents with no cloud and no API keys. You ingest a corpus, split it into chunks, and embed each chunk into a vector with nomic-embed-text running on Ollama. The vectors go into ChromaDB, a local file-based vector store. At query time you embed the question, retrieve the nearest chunks by cosine similarity, build a prompt that instructs a local LLM to answer only from those chunks, and return a grounded, cited answer. Everything runs on one machine, offline, at zero marginal cost.
The 60-Second TL;DR
5 Strategic Takeaways
- RAG is retrieval plus grounding, not magic. The model never learns your data. You fetch the relevant passages at query time and hand them over as context. Quality lives in retrieval, not in the model.
- Local is a feature, not a compromise. No API keys, no per-token billing, no connectivity dependency, and complete data sovereignty. For sensitive or regulated material, running offline is the requirement, not a nice-to-have.
- Two defaults will silently wreck your results. ChromaDB ships with L2 distance when you almost always want cosine, and nomic-embed-text may truncate long chunks at the default context window. Both fail without errors. Set them deliberately.
- Chunking and top-k are the real tuning surface. The model and the database barely change between projects. Chunk size, overlap, and retrieval depth are where you actually move answer quality.
- Grounding is your anti-hallucination lever. A prompt that instructs the model to answer only from retrieved context, cite its sources, and refuse when the context is insufficient does more for reliability than any model upgrade.
Why This Class Exists
Most RAG tutorials reach for a cloud embedding API and a hosted vector database, then stop at a demo that falls over the moment you point it at real documents. This class builds the opposite: a complete pipeline that runs entirely on hardware you own, with every default that bites you called out and fixed, and every line of code runnable.
Nothing leaves the machine
Embeddings, vectors, queries, and generation all run locally on Ollama and ChromaDB. No cloud, no keys, no data egress. The same property that makes it private makes it free and offline-capable.
Built on your own documents
The reference build ingests a folder of text and markdown files and answers questions about them with citations. Point it at your own docs and it works unchanged — it is a tool, not a toy.
The failures nobody warns you about
The L2-versus-cosine default and the embedding truncation limit both fail silently and quietly ruin retrieval. This class surfaces them up front with the exact fix, so your first build works.
No framework to hide the machinery
Built on the raw Ollama client and ChromaDB, not LangChain or LlamaIndex. You see every step, so you understand exactly what a framework would be doing for you — and can debug it when it breaks.
What Is Retrieval-Augmented Generation?
A language model knows only what it learned in training. Ask it about your own documents — your product catalog, your internal notes, last quarter's report — and it cannot answer, because that text was never in its training data. Retrieval-augmented generation fixes this without retraining anything. At query time, the system searches your documents for the passages most relevant to the question and hands them to the model as context. The model then answers from that context.
The key reframing: the model never learns your data. It stays frozen. Your data lives in a searchable store, and you inject the relevant slice into the prompt for each question. This is why RAG is cheap, updatable, and auditable — you can change a document and the next query reflects it instantly, and you can see exactly which passages produced an answer.
RAG is an open-book exam. The model is the student; your document store is the textbook. You do not make the student memorize the book — you let them look up the relevant page right before they answer. Get the lookup right and even a modest model gives sharp, grounded answers. Get it wrong and the best model in the world answers from the wrong page.
The Five Stages
Every RAG pipeline, local or cloud, is the same five stages. The first three happen once, at ingest. The last two happen on every query.
| Stage | When | What happens |
|---|---|---|
| 1. Ingest | Once | Load raw documents from disk — text, markdown, PDFs, whatever your corpus is. |
| 2. Chunk | Once | Split each document into passages small enough to embed precisely, with a little overlap. |
| 3. Embed | Once | Convert each chunk into a vector with an embedding model, and store the vectors plus metadata. |
| 4. Retrieve | Per query | Embed the question, then find the k chunks whose vectors are nearest to it. |
| 5. Generate | Per query | Build a prompt from the retrieved chunks and have the LLM produce a grounded, cited answer. |
Stages 1 through 3 are your indexing pipeline — run them whenever the corpus changes. Stages 4 and 5 are your query pipeline — they run in milliseconds-to-seconds per question. The separation matters: indexing is the expensive, occasional work; querying is the cheap, frequent work. Keep them as separate code paths.
Why Fully Local
You can run any of these five stages in the cloud. The decision to run all of them locally buys four specific things.
| Property | What it means in practice |
|---|---|
| Privacy | No document, query, or embedding is ever sent off-machine. Suitable for sensitive or regulated data where cloud processing is not allowed. |
| Zero marginal cost | No per-token embedding fees, no per-token generation fees, no per-vector storage charges. You pay for electricity and the hardware you already own. |
| Offline operation | Once the models are pulled, the system needs no internet. It runs on an airgapped machine. |
| Full control | You choose the embedding model, the generation model, the chunking, and the retrieval logic, and you can audit every step. No vendor changes anything underneath you. |
The cost of local is that you run and tune the components yourself, and your throughput is bounded by your hardware rather than an elastic cloud. For a sovereignty-first studio that already runs local models, that is a cost you have mostly already paid. The privacy and zero-marginal-cost properties are the payoff.
The Stack
Three components, plus Python to wire them together. Each is open-source and runs locally.
| Component | Role | Why this one |
|---|---|---|
| Ollama | Local model runtime | Runs both the embedding model and the generation model with one simple API. Already the default local-model runtime for most self-hosted setups. |
| nomic-embed-text | Embedding model | The most widely used local embedder: open-source, 768-dimensional output, 8192-token context, 137M params, benchmark-beating against older OpenAI embedders. |
| ChromaDB | Vector store | A local library, not a server. A PersistentClient writes SQLite plus index files to one directory. No Docker, no separate process, durable on disk. |
| A generation model | Answer generation | Any capable instruction-tuned model you run on Ollama. The heavier component; size it to your VRAM. |
Those frameworks wrap exactly these primitives and add their own abstractions. They are useful at scale, but for learning and for a small dependency surface they hide the machinery you most need to understand. This class uses the raw Ollama client and ChromaDB directly — two dependencies, every step visible. Once you have built it this way, adopting a framework later is a convenience, not a mystery.
The 8 Architectural Decisions
Decide these eight before you write the pipeline. Each has a default that is right most of the time and a failure mode that quietly degrades answers if you choose wrong. Two of them — distance metric and context window — fail with no error at all.
Decision 1 — Distance metric: cosine or L2
The vector store ranks chunks by a distance metric. Cosine measures angle between vectors; L2 measures straight-line distance.
Tradeoff: for normalized text embeddings, cosine is the standard and gives intuitive similarity scores. L2 can rank differently and is sensitive to vector magnitude.
Failure mode: ChromaDB defaults to L2. Leaving the default while expecting cosine produces subtly wrong rankings with no error. The space cannot be changed after the index is created, so you would have to rebuild.
Recommendation: set cosine explicitly at collection creation. This is the single most common ChromaDB mistake; fix it before you ingest a single document.
Decision 2 — Embedding context window
The embedder converts a chunk to a vector, but only up to its context limit.
Tradeoff: a larger window lets you embed bigger chunks whole; a smaller window forces finer chunking.
Failure mode: nomic-embed-text advertises 8192 tokens, but the default Ollama context may process only the first 2048, silently dropping the rest of a long chunk. The vector then represents only the chunk's opening, and retrieval quietly misses content that was actually there.
Recommendation: keep chunks comfortably under the effective window, or raise num_ctx so the whole chunk is embedded. Never assume the advertised limit is the active one.
Decision 3 — Chunk size
How big is each passage you embed?
Tradeoff: large chunks carry more context but blur multiple topics into one vector, hurting precision. Small chunks are precise but can lose the surrounding context that makes a passage meaningful.
Failure mode: embedding whole documents as single chunks. The vector averages everything, so retrieval returns the document for almost any query and the model drowns in irrelevant text.
Recommendation: start at a few hundred tokens per chunk. Tune from there against your actual corpus and queries. This is a knob you will return to.
Decision 4 — Chunk overlap
Do adjacent chunks share a little text at their boundaries?
Tradeoff: overlap preserves ideas that straddle a boundary, at the cost of some duplicate storage and the chance of retrieving two near-identical chunks.
Failure mode: zero overlap splitting a key sentence across two chunks, so neither chunk embeds the complete idea and neither retrieves well for it.
Recommendation: a small overlap — roughly 10 to 20 percent of chunk size — catches boundary-straddling ideas without much waste.
Decision 5 — Retrieval depth (top-k)
How many chunks do you pass to the model per query?
Tradeoff: low k risks missing the chunk that holds the answer; high k floods the prompt with noise and burns context window the generation model needs.
Failure mode: setting k to one and missing the answer whenever it lives in the second-ranked chunk; or setting k to twenty and burying the relevant passage in irrelevant ones.
Recommendation: start at three to five. Raise it only if you observe answers missing content that you know is in the corpus.
Decision 6 — Embedding model choice
Which model turns text into vectors?
Tradeoff: larger embedders can be marginally more accurate but slower and heavier; smaller ones are fast and cheap to run.
Failure mode: changing the embedding model after you have indexed. Vectors from different models are not comparable, so a query embedded with a new model cannot match chunks embedded with the old one. You must re-embed the whole corpus.
Recommendation: nomic-embed-text is the strong local default. Pick it once and re-index fully if you ever change it. Treat the embedder as part of the index's identity.
Decision 7 — Chunk identity and updates
How do you handle a document changing after it is indexed?
Tradeoff: rebuilding the whole index on any change is simple but slow at scale; incremental updates are fast but need stable IDs.
Failure mode: random or sequential IDs that change on every re-ingest, so you cannot tell which chunks already exist and you accumulate duplicates.
Recommendation: derive each chunk's ID deterministically from its source path and position, then upsert. Changed chunks overwrite; unchanged chunks are skipped. Re-indexing becomes incremental.
Decision 8 — Grounding and refusal
How hard do you constrain the model to the retrieved context?
Tradeoff: a loose prompt lets the model blend retrieved context with its training knowledge, which sometimes helps and sometimes hallucinates. A strict prompt confines it to the context and tells it to refuse when the context is insufficient.
Failure mode: a prompt that just pastes context and asks the question. The model answers confidently even when the retrieved chunks do not contain the answer, inventing details.
Recommendation: instruct the model explicitly to answer only from the provided context, to cite the source chunks, and to say it does not know when the context is insufficient. This is your strongest anti-hallucination lever.
Six of these decisions announce themselves when you get them wrong — answers are obviously off. Two do not: the L2-versus-cosine default and the embedding context truncation. Both produce a pipeline that runs cleanly and returns plausible-looking answers while quietly retrieving the wrong chunks. Set both deliberately before your first ingest and you avoid the most frustrating class of RAG bug.
Setup and Models
Two Python dependencies and two Ollama models. Nothing else. The reference build reads a folder of .txt and .md files and answers questions about them.
Project layout
sovereign-rag/
├── requirements.txt
├── corpus/ # drop your .txt and .md files here
│ ├── doc-a.md
│ └── doc-b.txt
├── chroma_store/ # created automatically; the persistent vector store
├── ingest.py # stages 1-3: ingest, chunk, embed, store
└── query.py # stages 4-5: retrieve, generate
requirements.txt
ollama>=0.4.0
chromadb>=0.5.0
Pull the models and install
# 1. Install Ollama from https://ollama.com, then pull the two models.
# The embedder (small, fast):
ollama pull nomic-embed-text
# A generation model sized to your VRAM. llama3.1 (8B) is a solid
# default for a 12GB card. Swap for any instruct model you prefer.
ollama pull llama3.1
# 2. Confirm Ollama is serving (default port 11434):
ollama list
# 3. Install the two Python deps (a venv is recommended):
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
Both scripts share the same embedding model name, generation model name, ChromaDB path, and collection name. Keep them identical across ingest and query — if the embedder differs between indexing and querying, the vectors are not comparable and retrieval silently fails. The scripts below define these as constants at the top for exactly this reason.
Ingest and Chunk
Stage 1 loads files; stage 2 splits them into overlapping chunks. The chunker here is deliberately simple and dependency-free: it splits on character count with overlap. Token-aware splitting is more precise, but character chunking is transparent and good enough to learn on — and it has no extra dependency.
import os
import hashlib
CHUNK_SIZE = 1200 # characters, not tokens — roughly 300 tokens
CHUNK_OVERLAP = 200 # ~15% overlap to catch boundary-straddling ideas
def load_corpus(corpus_dir: str) -> list[dict]:
"""Stage 1 — load every .txt and .md file in the corpus directory."""
docs = []
for root, _dirs, files in os.walk(corpus_dir):
for fname in files:
if not fname.lower().endswith((".txt", ".md")):
continue
path = os.path.join(root, fname)
with open(path, "r", encoding="utf-8") as fh:
text = fh.read()
docs.append({"source": os.path.relpath(path, corpus_dir), "text": text})
return docs
def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
"""Stage 2 — split one document into overlapping character windows."""
if size <= overlap:
raise ValueError("CHUNK_SIZE must be greater than CHUNK_OVERLAP")
chunks = []
start = 0
text = text.strip()
while start < len(text):
end = start + size
chunk = text[start:end].strip()
if chunk:
chunks.append(chunk)
# advance by size minus overlap so windows overlap
start += size - overlap
return chunks
def chunk_id(source: str, index: int) -> str:
"""Deterministic ID — same source + position always yields the same ID,
so re-ingest upserts in place instead of duplicating (Decision 7)."""
raw = f"{source}::{index}"
return hashlib.sha1(raw.encode("utf-8")).hexdigest()[:16]
Note the deterministic chunk_id. Because the ID is a hash of source path plus position, re-ingesting an unchanged file produces the same IDs, so ChromaDB's upsert overwrites in place rather than accumulating duplicates. Change a file and only its chunks change.
Embed and Store
Stage 3 embeds each chunk with nomic-embed-text via the Ollama client and stores the vector, the text, and the source metadata in ChromaDB. This is where the two silent-killer decisions get fixed in code: cosine distance is set at collection creation, and chunks are sized under the effective embedding window.
import chromadb
import ollama
CHROMA_PATH = "./chroma_store"
COLLECTION = "sovereign_corpus"
EMBED_MODEL = "nomic-embed-text"
def get_collection():
"""Open the persistent store and get-or-create the collection with
cosine distance set explicitly (Decision 1 — the L2 default footgun).
The space cannot be changed after creation, so it is set here once."""
client = chromadb.PersistentClient(path=CHROMA_PATH)
return client.get_or_create_collection(
name=COLLECTION,
configuration={"hnsw": {"space": "cosine"}},
)
def embed_one(text: str) -> list[float]:
"""Stage 3 — embed a single chunk with nomic-embed-text on Ollama.
Returns a 768-dimension vector."""
resp = ollama.embed(model=EMBED_MODEL, input=text)
# ollama.embed returns {"embeddings": [[...]]} — one row per input
return resp["embeddings"][0]
def index_corpus(corpus_dir: str) -> int:
"""Run stages 1-3 end to end and upsert into ChromaDB.
Returns the number of chunks indexed."""
collection = get_collection()
docs = load_corpus(corpus_dir)
ids, embeddings, documents, metadatas = [], [], [], []
for doc in docs:
chunks = chunk_text(doc["text"])
for i, chunk in enumerate(chunks):
ids.append(chunk_id(doc["source"], i))
embeddings.append(embed_one(chunk))
documents.append(chunk)
metadatas.append({"source": doc["source"], "chunk": i})
if not ids:
print("No chunks found. Add .txt or .md files to the corpus directory.")
return 0
# upsert so re-runs overwrite by deterministic ID instead of duplicating
collection.upsert(
ids=ids,
embeddings=embeddings,
documents=documents,
metadatas=metadatas,
)
return len(ids)
ChromaDB can call an embedding function for you, but here we embed with the Ollama client ourselves and pass the vectors in directly. This keeps the embedder identical across ingest and query, makes the 768-dimension flow visible, and avoids depending on a ChromaDB-Ollama integration shim. The query script embeds the question the exact same way — same model, same client — so the vectors are always comparable.
Retrieve and Generate
Stages 4 and 5 run on every query. Stage 4 embeds the question with the same model used at ingest, then asks ChromaDB for the nearest chunks. Stage 5 builds a grounded prompt from those chunks and has the local LLM answer.
GEN_MODEL = "llama3.1"
TOP_K = 4 # Decision 5 — retrieval depth
def retrieve(question: str, k: int = TOP_K) -> list[dict]:
"""Stage 4 — embed the question the SAME way as ingest, then fetch
the k nearest chunks by cosine distance."""
collection = get_collection()
q_vec = embed_one(question) # same embedder, same client
res = collection.query(
query_embeddings=[q_vec],
n_results=k,
include=["documents", "metadatas", "distances"],
)
# ChromaDB returns parallel lists wrapped one level deep (per query)
hits = []
for doc, meta, dist in zip(
res["documents"][0], res["metadatas"][0], res["distances"][0]
):
hits.append({"text": doc, "source": meta["source"],
"chunk": meta["chunk"], "distance": dist})
return hits
def build_prompt(question: str, hits: list[dict]) -> str:
"""Stage 5a — assemble a grounded prompt. Each chunk is labeled with a
source tag so the model can cite it (Decision 8 — grounding + refusal)."""
context_blocks = []
for n, h in enumerate(hits, start=1):
tag = f"[{n}] {h['source']} (chunk {h['chunk']})"
context_blocks.append(f"{tag}\n{h['text']}")
context = "\n\n".join(context_blocks)
return (
"You are a careful assistant. Answer the question using ONLY the "
"context below. Cite the sources you used by their bracket number, "
"e.g. [1]. If the context does not contain the answer, say you do "
"not know rather than guessing.\n\n"
f"CONTEXT:\n{context}\n\n"
f"QUESTION: {question}\n\n"
"ANSWER (with citations):"
)
def answer(question: str, k: int = TOP_K) -> dict:
"""Stages 4-5 — retrieve, ground, and generate locally. Returns the
answer text plus the sources it was grounded in."""
hits = retrieve(question, k)
if not hits:
return {"answer": "The knowledge base is empty. Run ingest first.",
"sources": []}
prompt = build_prompt(question, hits)
resp = ollama.chat(
model=GEN_MODEL,
messages=[{"role": "user", "content": prompt}],
)
return {
"answer": resp["message"]["content"],
"sources": [{"n": i + 1, "source": h["source"], "chunk": h["chunk"]}
for i, h in enumerate(hits)],
}
Three details earn their place here. The query is embedded with embed_one — the exact same function ingest used — so the vectors are comparable. The prompt names each chunk with a bracket tag, giving the model something concrete to cite. And the instruction tells the model to refuse when the context is insufficient, which is the difference between a grounded answer and a confident hallucination.
The Complete Scripts
Assembled into the two runnable files. Everything above, in order, with a command-line entry point on each. Copy these verbatim into ingest.py and query.py.
ingest.py
import os
import sys
import hashlib
import chromadb
import ollama
# ---- shared config (must match query.py) -------------------------------
CHROMA_PATH = "./chroma_store"
COLLECTION = "sovereign_corpus"
EMBED_MODEL = "nomic-embed-text"
CHUNK_SIZE = 1200 # characters (~300 tokens)
CHUNK_OVERLAP = 200 # ~15% overlap
def load_corpus(corpus_dir: str) -> list[dict]:
docs = []
for root, _dirs, files in os.walk(corpus_dir):
for fname in files:
if not fname.lower().endswith((".txt", ".md")):
continue
path = os.path.join(root, fname)
with open(path, "r", encoding="utf-8") as fh:
text = fh.read()
docs.append({"source": os.path.relpath(path, corpus_dir), "text": text})
return docs
def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
if size <= overlap:
raise ValueError("CHUNK_SIZE must be greater than CHUNK_OVERLAP")
chunks, start, text = [], 0, text.strip()
while start < len(text):
chunk = text[start:start + size].strip()
if chunk:
chunks.append(chunk)
start += size - overlap
return chunks
def chunk_id(source: str, index: int) -> str:
return hashlib.sha1(f"{source}::{index}".encode("utf-8")).hexdigest()[:16]
def get_collection():
client = chromadb.PersistentClient(path=CHROMA_PATH)
return client.get_or_create_collection(
name=COLLECTION,
configuration={"hnsw": {"space": "cosine"}},
)
def embed_one(text: str) -> list[float]:
return ollama.embed(model=EMBED_MODEL, input=text)["embeddings"][0]
def index_corpus(corpus_dir: str) -> int:
collection = get_collection()
ids, embeddings, documents, metadatas = [], [], [], []
for doc in load_corpus(corpus_dir):
for i, chunk in enumerate(chunk_text(doc["text"])):
ids.append(chunk_id(doc["source"], i))
embeddings.append(embed_one(chunk))
documents.append(chunk)
metadatas.append({"source": doc["source"], "chunk": i})
if not ids:
print("No chunks found. Add .txt or .md files to the corpus directory.")
return 0
collection.upsert(ids=ids, embeddings=embeddings,
documents=documents, metadatas=metadatas)
return len(ids)
if __name__ == "__main__":
corpus_dir = sys.argv[1] if len(sys.argv) > 1 else "./corpus"
n = index_corpus(corpus_dir)
print(f"Indexed {n} chunks from {corpus_dir} into {COLLECTION}.")
query.py
import sys
import chromadb
import ollama
# ---- shared config (must match ingest.py) ------------------------------
CHROMA_PATH = "./chroma_store"
COLLECTION = "sovereign_corpus"
EMBED_MODEL = "nomic-embed-text"
GEN_MODEL = "llama3.1"
TOP_K = 4
def get_collection():
client = chromadb.PersistentClient(path=CHROMA_PATH)
return client.get_or_create_collection(
name=COLLECTION,
configuration={"hnsw": {"space": "cosine"}},
)
def embed_one(text: str) -> list[float]:
return ollama.embed(model=EMBED_MODEL, input=text)["embeddings"][0]
def retrieve(question: str, k: int = TOP_K) -> list[dict]:
collection = get_collection()
res = collection.query(
query_embeddings=[embed_one(question)],
n_results=k,
include=["documents", "metadatas", "distances"],
)
hits = []
for doc, meta, dist in zip(
res["documents"][0], res["metadatas"][0], res["distances"][0]
):
hits.append({"text": doc, "source": meta["source"],
"chunk": meta["chunk"], "distance": dist})
return hits
def build_prompt(question: str, hits: list[dict]) -> str:
blocks = []
for n, h in enumerate(hits, start=1):
blocks.append(f"[{n}] {h['source']} (chunk {h['chunk']})\n{h['text']}")
context = "\n\n".join(blocks)
return (
"You are a careful assistant. Answer the question using ONLY the "
"context below. Cite the sources you used by their bracket number, "
"e.g. [1]. If the context does not contain the answer, say you do "
"not know rather than guessing.\n\n"
f"CONTEXT:\n{context}\n\n"
f"QUESTION: {question}\n\n"
"ANSWER (with citations):"
)
def answer(question: str, k: int = TOP_K) -> dict:
hits = retrieve(question, k)
if not hits:
return {"answer": "The knowledge base is empty. Run ingest.py first.",
"sources": []}
resp = ollama.chat(
model=GEN_MODEL,
messages=[{"role": "user", "content": build_prompt(question, hits)}],
)
return {
"answer": resp["message"]["content"],
"sources": [{"n": i + 1, "source": h["source"], "chunk": h["chunk"]}
for i, h in enumerate(hits)],
}
if __name__ == "__main__":
if len(sys.argv) < 2:
print('Usage: python query.py "your question here"')
sys.exit(1)
result = answer(" ".join(sys.argv[1:]))
print(result["answer"])
print("\nSources:")
for s in result["sources"]:
print(f" [{s['n']}] {s['source']} (chunk {s['chunk']})")
Run It
Three commands from a cold start to a grounded, cited answer.
# 1. Index your corpus (stages 1-3). Drop files in ./corpus first.
python ingest.py ./corpus
# → "Indexed 142 chunks from ./corpus into sovereign_corpus."
# 2. Ask a question (stages 4-5).
python query.py "What is our return policy for sale items?"
# → a grounded answer, followed by the source chunks it used.
# 3. Re-run ingest any time the corpus changes. Deterministic IDs mean
# unchanged chunks are overwritten in place, not duplicated.
python ingest.py ./corpus
The first ingest is the slow step — every chunk is embedded once. Queries are fast: one embedding call for the question, a millisecond vector search, and one generation call. Nothing in this loop touches the network.
Pull the models while online, then disconnect the network entirely and run both commands again. They work unchanged. That is the proof of sovereignty: once nomic-embed-text and your generation model are on disk, the entire pipeline — embed, store, retrieve, generate — runs with no internet and no external service. Your documents never leave the machine.
Grounding and Citations
The single biggest reliability lever in RAG is the prompt's grounding instruction. The build above already does three things that most first attempts miss.
It tags every chunk. Each retrieved chunk enters the prompt with a bracket number and its source path. The model is asked to cite by bracket number, so the answer carries provenance you can verify. A claim cited [2] can be traced straight back to a specific chunk of a specific file.
It constrains to the context. The instruction says to answer using only the provided context. This stops the model from blending in half-remembered training knowledge, which is where most RAG hallucinations originate — not from the retrieved text, but from the model reaching past it.
It permits refusal. The model is told to say it does not know when the context is insufficient. Counterintuitively, a system that sometimes says "I don't know" is more trustworthy than one that always answers, because its answers mean something. Refusal is a feature.
In a sovereign system there is no external service to blame and no support ticket to file when an answer is wrong. You own the whole stack. Citations turn every answer into something auditable: you can open the cited chunk, confirm the claim, and if the retrieval was wrong, you know exactly which knob — chunk size, top-k, distance metric — to turn. Provenance is how you debug a system you fully own.
Tuning and Gotchas
Once the pipeline runs, answer quality lives in a small number of knobs. Turn these, in this order.
Fix the two silent killers first
Before tuning anything, confirm both invisible defaults are handled. Cosine distance must be set at collection creation — the build does this in get_collection, and because the space cannot change after creation, a store built on the L2 default must be deleted and rebuilt. Embedding truncation: with a 1200-character chunk you are well under the 2048-token effective window, so nomic-embed-text sees the whole chunk. If you raise CHUNK_SIZE substantially, either keep it under that window or raise the model's context so chunks are not silently clipped.
Then tune retrieval
| Symptom | Knob | Direction |
|---|---|---|
| Answers miss content you know is in the corpus | TOP_K | Raise it — the answer chunk may rank just outside the current k. |
| Answers are vague or wander off-topic | TOP_K | Lower it — too many chunks dilute the relevant passage with noise. |
| Retrieval returns whole-topic blur, low precision | CHUNK_SIZE | Lower it — large chunks average too many ideas into one vector. |
| Retrieved chunks cut off mid-thought | CHUNK_OVERLAP | Raise it — more overlap preserves ideas that straddle boundaries. |
| Answers correct but model adds outside facts | Prompt | Tighten the grounding instruction; emphasize "only from the context." |
Measuring retrieval, not just vibes
Tune against evidence. Assemble a small set of questions whose answers you know live in specific chunks. After each change, run them and check whether the right chunk appears in the retrieved set and whether the answer cites it. This turns tuning from guesswork into a measurable loop — exactly the proof-over-vibes discipline a sovereign system rewards, because you are the only one who can validate it.
Failure Modes
| Symptom | Cause | Fix |
|---|---|---|
| Plausible answers, consistently wrong chunks | Collection left on L2 distance instead of cosine | Recreate the collection with cosine; the space cannot be changed in place. |
| Long documents retrieve poorly | Chunks exceed the effective embedding window and are silently truncated | Keep chunks under the window or raise num_ctx so the full chunk embeds. |
| Query embedding fails to match anything | Question embedded with a different model than the corpus | Use one EMBED_MODEL constant shared by ingest and query — never diverge. |
| Duplicate chunks pile up on re-ingest | Non-deterministic IDs, so upsert cannot match existing chunks | Derive IDs deterministically from source + position, then upsert. |
| Model invents facts not in the corpus | Prompt does not constrain the model to the retrieved context | Instruct "answer only from context," require citations, allow refusal. |
| Connection refused on embed or chat call | Ollama is not running or not serving on the expected port | Start Ollama and confirm with ollama list; default endpoint is port 11434. |
The Sovereignty Case
It is worth being explicit about why this architecture is worth the tuning effort, because the cloud alternative is genuinely easier to stand up.
A hosted RAG service gives you managed embeddings, a managed vector database, and elastic scale in an afternoon. In exchange, your documents are processed on someone else's hardware, every embedding and every query is billed, and the system stops working when the network does. For a great many projects that trade is fine.
For a sovereignty-first studio, it is not. When the corpus is your own internal knowledge, your clients' material, or anything you are not willing to send to a third party, local is not a preference — it is the constraint. The same decision that protects the data also removes the per-token meter and the connectivity dependency. You run capable models on hardware you already own, index once, and query forever at the cost of electricity.
The first local RAG build is the expensive one — you learn the five stages, the two silent defaults, and the tuning loop all at once. Every build after reuses the entire scaffold: same two scripts, same shared-config discipline, same cosine-and-context fixes, same deterministic-ID upsert. You only swap the corpus and retune chunk size and top-k. The pipeline becomes infrastructure, and a private, offline, zero-marginal-cost knowledge base becomes something you can stand up for any document set in an afternoon — on hardware that is entirely yours.
Closing Notes
You now have a complete, runnable, fully local RAG system and the judgment to tune it. The compressed playbook for your next knowledge base:
- Set cosine before you ingest. ChromaDB defaults to L2, the space cannot change after creation, and the failure is silent. This is the first line of
get_collectionfor a reason. - Keep chunks under the embedding window. nomic-embed-text may truncate long chunks with no error. Size chunks deliberately or raise the context.
- Share one config across ingest and query. Same embedding model, same path, same collection. If the embedder diverges, vectors stop being comparable and retrieval quietly fails.
- Use deterministic chunk IDs and upsert. Re-indexing becomes incremental instead of a duplicate-accumulating full rebuild.
- Tune chunk size and top-k against a known-answer test set. These two knobs move quality more than the model does. Measure, do not guess.
- Ground the prompt and allow refusal. Answer only from context, cite by bracket number, say "I don't know" when the context is thin. This is your anti-hallucination lever.
- Prove it offline. Disconnect the network and run it. If it still answers, you have genuine data sovereignty — not a cloud dependency in disguise.
A local RAG pipeline is a building block, not an endpoint. The same retrieve-and-ground pattern feeds an agent that needs grounded context, a local assistant over your own notes, or a private knowledge layer behind a larger orchestrator. You have built the sovereign retrieval primitive; everything downstream — agents, tools, interfaces — composes on top of it. The corpus is yours, the models are yours, the machine is yours. Nothing leaves.
Frequently Asked Questions
Eighteen questions builders ask most. These mirror the FAQPage schema at the top of the page, which surfaces them in AI overviews and rich results.
What is retrieval-augmented generation (RAG)?
RAG is a technique that enhances a language model by giving it access to external knowledge at query time. Instead of relying only on what the model learned in training, the system retrieves relevant chunks from your own documents and feeds them to the model as context. The result is an answer grounded in your specific data rather than the model's general training.
What does it mean for RAG to be fully local?
Every component runs on your own machine: the embedding model, the vector database, and the language model. No documents, queries, or embeddings are sent to any cloud service. There are no API keys and no per-token costs, and the system works with no internet connection once the models are downloaded. Data sovereignty is complete.
What are the five stages of a RAG pipeline?
Ingest: load your raw documents. Chunk: split them into passages small enough to embed well. Embed: convert each chunk into a vector with an embedding model. Retrieve: embed the query and find the nearest chunks by vector similarity. Generate: build a prompt from the retrieved chunks and have the language model produce a grounded answer.
Which embedding model should I use locally?
nomic-embed-text is the most widely used local embedding model on Ollama. It is fully open-source, has an 8192-token context length, runs fast at 137 million parameters, produces 768-dimensional vectors, and integrates cleanly with ChromaDB, LangChain, and LlamaIndex. It outperforms OpenAI's older ada-002 and text-embedding-3-small on standard benchmarks.
What is the nomic-embed-text context truncation gotcha?
Although nomic-embed-text supports up to 8192 tokens, the default context window in Ollama may only process the first 2048 tokens, silently truncating longer chunks. This degrades embedding quality on long passages without any error. The fix is to keep your chunks comfortably under the effective window, or raise the num_ctx parameter so the full chunk is embedded.
Why ChromaDB for the vector store?
ChromaDB runs as a local library with no server, no Docker, and no separate process. A PersistentClient points at a directory and stores everything as SQLite plus binary index files. Each collection bundles the documents, an HNSW index over their embeddings, the metadata, and the embedding function. It is the simplest path to a durable local vector store.
What is the ChromaDB cosine distance gotcha?
ChromaDB defaults to L2 (squared Euclidean) distance, not cosine. For normalized text embeddings you almost always want cosine, so you must set it explicitly when creating the collection. The space cannot be changed after the index is created, so getting it right at creation time matters. Two of the most common ChromaDB bug reports trace directly to this default.
How do I set cosine distance in ChromaDB?
Pass a configuration with the HNSW space set to cosine when you create the collection. The modern form is a configuration dictionary with an hnsw block specifying the space. The older form, a metadata dictionary with an hnsw colon space key, still works for backwards compatibility but is deprecated in current ChromaDB. Set it at creation; it cannot be changed afterward.
What is chunking and why does it matter?
Chunking splits documents into passages before embedding. Chunk size is a tradeoff: too large and a single vector blurs many topics together, hurting retrieval precision; too small and you lose the surrounding context that makes a passage meaningful. A common starting point is several hundred tokens per chunk with a small overlap so ideas that straddle a boundary are not lost.
What is top-k retrieval?
After embedding the query, the system finds the k chunks whose vectors are nearest to it and passes those to the language model as context. k is a tuning knob: too low and you may miss the chunk that holds the answer; too high and you flood the prompt with noise and waste the context window. Three to five chunks is a common starting range.
Do I need LangChain or LlamaIndex to build local RAG?
No. They are convenience frameworks that wrap the same primitives. You can build a complete, production-grade local RAG pipeline with just the Ollama client and ChromaDB, which keeps the dependency surface small and every step visible. This class builds on the raw primitives first, so you understand what any framework is doing for you.
How much hardware do I need to run local RAG?
Embedding with nomic-embed-text is light at 137 million parameters and runs comfortably on a modest GPU or even CPU. The generation model is the heavier component; a consumer GPU with 8 to 12 gigabytes of VRAM runs capable mid-sized models well. The vector store itself is trivial, since ChromaDB is just SQLite plus index files on disk.
How do I keep the answer grounded and reduce hallucination?
Build the prompt so the model is instructed to answer only from the retrieved context and to say when the context does not contain the answer. Include the source chunk identifiers in the prompt so the model can cite them. Grounding plus an explicit instruction to refuse when context is insufficient is the most effective lever against hallucination in a RAG system.
Can I add citations to RAG answers?
Yes. Store a source identifier in each chunk's metadata at ingest time. When you retrieve the top-k chunks, you have their metadata, so you can pass the identifiers into the prompt and ask the model to attribute each claim. The class shows a pattern that returns the answer alongside the list of source chunks it was grounded in.
How do I update the knowledge base when documents change?
Give each chunk a stable, deterministic ID derived from its source and position. On re-ingest, upsert by that ID so changed chunks overwrite their previous version and unchanged chunks are skipped. This makes re-indexing incremental rather than a full rebuild, which matters once the corpus is large.
Is local RAG private enough for sensitive data?
Because nothing leaves the machine, local RAG is well suited to sensitive or regulated data where cloud processing is not acceptable. Privacy is the primary reason most self-hosted LLM users choose local pipelines. The remaining responsibility is ordinary local security: who can access the machine, the vector store directory, and the source documents.
How does local RAG compare to a cloud RAG service?
Cloud RAG services are faster to stand up and scale elastically, but they send your data off-machine, charge per token and per stored vector, and depend on connectivity. Local RAG trades that convenience for complete data control, zero marginal cost, and offline operation, at the cost of running and tuning the components yourself. For sovereignty-first work, local wins.
What can break a local RAG pipeline?
The usual failures are a distance-metric mismatch from leaving ChromaDB on its L2 default, silent embedding truncation from the nomic context-window limit, chunks that are too large to retrieve precisely, a retrieval depth that is too low to surface the answer, and a prompt that does not constrain the model to the retrieved context. Each has a clear fix covered in the class.
