02 March 2026·13 min

Retrieval Is a Product Problem: Building a RAG Agent That a Small Ops Team Actually Uses

How I built a Retrieval-Augmented Generation knowledge agent on Claude and ChromaDB that ingests messy organisational documents, answers cross-document questions with inline citations, and keeps session memory without poisoning new conversations.

Open wooden library card catalogue drawer with rows of yellowed index cards — A retrieval system is just a card catalogue that knows what you meant.

The original problem was not glamorous. A small operations team had years of knowledge buried across PDFs, email threads exported to text, and meeting notes in shared docs. When somebody new joined, the answer to most of their questions was 'ask Sara, she was on that account in 2022'. When Sara was on leave, the answer was 'wait until Sara is back'.

I wanted to build something where a new joiner could type a question in plain English and get a real answer, with citations, in a few seconds. Not a chatbot pretending to know things. A retrieval system that admitted what it had read and what it had not.

Why the first version was bad

The first version did the obvious thing: split every document into 500-token chunks with a 50-token overlap, embed them, retrieve the top five for each query, stuff them into a Claude prompt, ask for an answer. It worked on easy questions and lied confidently on hard ones.

The reason was almost always the same. Naive fixed-length chunking cut sentences in half and split tables across chunks, so the retriever was being asked to match a query against fragments that no human would consider a coherent unit of meaning. The embeddings were doing their job. The chunks were the bug.

Semantic chunking, simply

I replaced fixed-length chunking with a small heuristic that respects document structure first and length second.

def semantic_chunks(text, max_tokens=800, min_tokens=120):
    sections = split_on_headings(text)              # h1/h2/h3, bullet groups, table blocks
    chunks = []
    for section in sections:
        if token_count(section) <= max_tokens:
            chunks.append(section)
            continue
        # too long: split on paragraph boundaries, never mid-sentence
        paras, buf = section.split('\n\n'), []
        for p in paras:
            buf.append(p)
            if token_count('\n\n'.join(buf)) >= max_tokens:
                chunks.append('\n\n'.join(buf)); buf = []
        if buf and token_count('\n\n'.join(buf)) >= min_tokens:
            chunks.append('\n\n'.join(buf))
    return chunks

Two rules made the biggest difference. Never split inside a sentence, and never split a table away from its header row. Both sound obvious in retrospect. Both required actually opening the bad chunks from the first version and looking at them, which is the unglamorous middle step that most RAG tutorials skip.

The embeddings choice and the cost math

I used a 768-dimensional sentence-transformer model running locally rather than a hosted embedding API. Two reasons: cost predictability, and the fact that I could re-embed the entire corpus on a laptop in minutes if I needed to change the model. With a hosted API and a few thousand documents, re-embedding becomes a budget conversation, and budget conversations slow down iteration.

The cost math that mattered: at our document volume, embedding once cost roughly the same as a single round of testing the QA flow with a small group. So I stopped re-embedding the whole corpus on every change and started caching by content hash. New documents get embedded. Unchanged documents are skipped. A document whose text changed gets re-embedded and its old chunks are deleted from the store in the same transaction.

Retrieval that an operator can audit

Retrieval is top-k with maximal marginal relevance for diversity. Nothing exotic. The interesting decision was to expose the retrieved chunks in the UI, not hide them. When the agent answers, the user sees, in a sidebar, exactly which chunks fed the answer, with the document name, the section heading, and a short excerpt.

This was uncomfortable to ship. It looks less magical. It also turned out to be the single biggest reason the team trusted the tool. When an answer was wrong, they could click through, see why, and tell me whether the bug was in retrieval (the right chunk was not pulled) or in generation (the right chunk was pulled and the model still said the wrong thing). I cannot overstate how much faster that made debugging.

Inline citations and out-of-context detection

Every claim in an answer is followed by a citation id that links back to the exact chunk it came from. I enforce this by giving the model a whitelist of citation ids in the prompt (the ids of the retrieved chunks) and post-processing the answer to reject any id outside the whitelist.

I also built a small out-of-context detector. After the model produces an answer, I run a cheap second call that takes the answer and the retrieved chunks and asks: 'are there claims in this answer that are not supported by any chunk?'. If the answer is yes, the UI flags the unsupported claims in amber. It is not perfect. It is good enough to keep the team from over-trusting the agent on questions where retrieval came back thin.

Session memory without poisoning

Conversational memory is the part most teams get subtly wrong. The easy implementation, append every prior turn to the context, is a slow leak: stale facts from yesterday's conversation start influencing today's answer about a different account.

What I do instead: store the conversation in the vector store with a session id, and retrieve from it only when the new query references something the new query alone does not specify ('what about the other one', 'the same client as last week'). For self-contained queries, the agent does not see any prior turns at all. The improvement in answer quality on day-two queries was noticeable enough that one of the consultants told me, unprompted, that the agent had 'stopped getting confused'.

Incremental ingestion

Documents arrive every day. A naive pipeline would re-ingest the whole folder nightly and quietly cost real money. My pipeline watches the documents directory, hashes each file, and processes only what changed. New files get chunked and embedded. Modified files get their old chunks deleted before re-embedding. Deleted files get their chunks pruned.

for path in watched_dir.rglob('*'):
    if not path.is_file():
        continue
    h = sha256_file(path)
    prev = manifest.get(str(path))
    if prev == h:
        continue                                   # unchanged
    if prev is not None:
        store.delete(where={'source_path': str(path)})
    chunks = semantic_chunks(read_text(path))
    store.add(documents=chunks, metadatas=[{'source_path': str(path), 'hash': h}]*len(chunks))
    manifest[str(path)] = h

A small thing that mattered: the manifest is written atomically at the end of each ingestion run. If the process is killed halfway, the next run will reprocess the in-progress files rather than skipping them on a stale manifest. Crash safety is cheap if you design for it on day one and painful to retrofit later.

What a Tuesday morning actually looks like

The thing I am proudest of is not technical. It is the fact that on a normal Tuesday morning, a consultant who has never read a line of Python opens the Streamlit page, types 'what did we agree with the Geneva client about cancellation terms last quarter', and gets a two-paragraph answer with three citations she can click through, in about six seconds. The technical work above only matters because that interaction works without ceremony.

Retrieval is a product problem before it is a model problem. A weaker model with better chunks and honest citations will beat a stronger model with worse retrieval every time, and it will keep beating it as the corpus grows.

RAGClaude APIChromaDBEmbeddingsPython