15 January 2026·14 min

Five Agents in a Trench Coat: What I Learned Building a Production Multi-Agent Orchestrator

A walk through the autonomous multi-agent system I shipped on top of Claude, where a researcher, analyst, writer, QA, and output agent collaborate via typed JSON contracts to turn one prompt into a full briefing pack in under ninety seconds.

Closeup of a dark server rack with a row of glowing green and amber status LEDs and blue fibre cables — Five agents, five lights, one orchestrator deciding who runs next.

For a long time my default move with a hard task was to write a longer prompt. More context, more constraints, a few examples, maybe a chain-of-thought nudge. It works until it doesn't, and the place it stops working is roughly the moment a task has more than one real step in it.

The task that broke this for me was a stakeholder briefing pack at work. The shape of it: take a single sentence from an operations manager, go and find the relevant information across the web and our internal PDFs, pull out the numbers that matter, write a two-page brief in our house voice, fact-check every claim, and hand back a clean document. One prompt to one model produced something that looked impressive and was, on inspection, wrong in small ways everywhere.

So I stopped trying to make one agent do all of it and built five.

Why five agents and not one bigger prompt

The five roles are researcher, analyst, writer, QA, and output. The split is not arbitrary. Each role has a different success criterion, and each one fails in a different way. A researcher that hallucinates is failing differently from a writer that hallucinates, and the fix is different. Once I started thinking about it that way, separating them stopped feeling like over-engineering and started feeling like the only honest answer.

The hardest decision was making QA its own agent rather than a clause inside the writer's system prompt. I tried the clause version first. The writer was simply too invested in its own output to fail it, the way a developer who just wrote a function is the worst person to review it. A separate QA agent, given only the final draft and the source material, with no memory of how the draft was produced, catches things the writer will never catch on itself.

Typed JSON contracts between agents

Agents talk to each other in JSON. Not free text, not markdown, not 'whatever the model felt like producing this time'. Every message has a schema, validated with pydantic before it is allowed to move down the pipeline.

from pydantic import BaseModel, Field
from typing import Literal

class ResearchFinding(BaseModel):
    claim: str = Field(min_length=1, max_length=500)
    source_url: str
    source_excerpt: str
    confidence: Literal['high', 'medium', 'low']

class ResearcherOutput(BaseModel):
    topic: str
    findings: list[ResearchFinding]
    gaps: list[str]   # what the researcher could not answer

The gaps field matters more than it looks. Early versions of the researcher would silently invent confident answers for things it could not find. Forcing it to explicitly list what it could not answer, and rewarding it in the prompt for being honest about gaps, cut hallucinated findings by something like an order of magnitude. I did not measure it cleanly enough to put a real number on it, but the QA agent's rejection rate roughly halved overnight.

The router and the retry logic

Between every two agents sits a small router. Its job is unglamorous: validate the upstream output against the schema, retry once with the validation error fed back into the prompt if it fails, and if it fails a second time, route to a fallback path rather than crashing the run.

def call_agent(agent, payload, schema, max_retries=2):
    last_error = None
    for attempt in range(max_retries + 1):
        raw = agent.run(payload, last_error=last_error)
        try:
            return schema.model_validate_json(raw)
        except ValidationError as e:
            last_error = str(e)
    return fallback_for(agent, payload, last_error)

The fallback is usually a degraded version of the same step. If the analyst cannot produce a clean structured analysis after two tries, the fallback hands the writer the raw research findings with a flag that says, in effect, 'the analysis layer skipped, write conservatively and flag uncertainty inline'. The pipeline degrades. It does not silently fail.

ChromaDB as persistent session memory

The orchestrator keeps a ChromaDB vector store per stakeholder. Every research finding, every approved brief, every QA comment that resulted in a revision, goes in. On the next run for the same stakeholder, the researcher gets a head start: it queries the store with the new topic and pulls back anything previously verified, so it does not re-verify things we already know.

I was careful about what gets re-embedded. Source documents are embedded once and cached by content hash. Agent outputs are embedded only after they pass QA, never on the way in. That single rule, only memorise things that survived review, kept the store from poisoning future runs with rejected drafts.

Claude tool use for the dirty work

The researcher is the only agent allowed to touch the outside world. It does so through Claude's tool use, with four tools wired in: a web search, a PDF reader, a structured data extractor, and a citation tracker. Everything else, the analyst, writer, QA, output agent, operates only on the JSON it is handed.

tools = [
    {'name': 'web_search', 'description': 'Search the public web. Returns top 5 results with snippets.', 'input_schema': {...}},
    {'name': 'read_pdf', 'description': 'Fetch a PDF by URL or path and return its text, chunked by section.', 'input_schema': {...}},
    {'name': 'extract_table', 'description': 'Extract a structured table from a chunk of text. Returns JSON rows.', 'input_schema': {...}},
    {'name': 'cite', 'description': 'Register a citation. Returns a citation id to reference in findings.', 'input_schema': {...}},
]

resp = client.messages.create(
    model='claude-sonnet-4',
    max_tokens=4096,
    tools=tools,
    messages=conversation,
)

Constraining the researcher to register every citation through the cite tool was one of the most useful guardrails I added. The tool persists the URL, the excerpt, and a timestamp, and returns an id. Any finding that references a citation id that was never registered fails validation. Hallucinated citations stop being a class of bug.

Failure modes that taught me the most

Three failure modes were responsible for almost every bad run in the first month.

Silent JSON drift. The model returns something that is almost valid JSON but with, say, a stray trailing comma or a property the schema does not have. Solved by strict pydantic validation and a one-shot retry with the validation error fed back as context.

Runaway tool loops. The researcher keeps calling web_search with slight variations of the same query, racking up cost and time. Solved by capping tool calls per agent per run, and by giving the researcher an explicit 'give up and report a gap' option in its system prompt.

Citation hallucination. The writer cites a finding the researcher never produced. Solved by passing the writer a whitelist of valid citation ids and rejecting any draft that references an id outside it.

The Streamlit dashboard

The operator-facing surface is a Streamlit app that streams agent traces in real time. Every JSON message between agents is shown as a collapsible card, colour-coded by agent. A run that takes seventy seconds end to end is shown as it happens, so the operator can see exactly where time is going and which agent is on the critical path.

I did not expect this to matter as much as it did. The dashboard is the reason non-technical colleagues trust the system. Watching the researcher pull a real URL, then watching the QA agent reject a draft and the writer revise it, makes the pipeline feel legible. A black box that produces the same brief in the same time would not have been adopted.

What I would change next

Two things are on my list. First, a planner agent sitting above the five, deciding which subset of agents a given task actually needs. Not every brief needs a full research pass, and the planner could skip straight to writing for follow-up questions on an already-researched topic.

Second, ChromaDB has been good enough but I keep getting bitten by pure-vector retrieval missing exact-keyword matches. The next iteration will be a hybrid: BM25 for keyword precision, vectors for semantic recall, reranked together. I expect the bigger win to be in the reranker, not the index.

A multi-agent system is just a way of admitting, in code, that no single prompt is responsible enough to be trusted with the whole job. Once you accept that, the rest is plumbing.

Claude APIMulti-AgentPythonChromaDBTool Use