Code-review bot

Production reference architecture for a pull-request review bot fronting ~24 analysis tools. Demonstrates how the context firewall carries the weight of a code-review workflow: large diff and grep outputs go straight to the artifact store while the prompt stays compact.

TL;DR

What	Where
The script	`examples/architectures/code_review_bot/main.py`
The catalog	`examples/architectures/code_review_bot/catalog.yaml`
Captured output	`examples/architectures/code_review_bot/OUTPUT.md`
Local README	`examples/architectures/code_review_bot/README.md`

Run it:

python examples/architectures/code_review_bot/main.py

(Or make architectures / make example.)

The shape

The bot walks a six-step review of a regression-introducing refactor in payments/charge.py:

"show me the diff of this pull request against main" — git.diff (large output → firewall)
"grep for the symbol legacy_charge in the codebase" — grep.symbol (large output → firewall)
"run the test suite for the changed module" — test.run_module
"run mypy on the changed module to surface type errors" — typecheck.module
"run ruff on the changed files and report style violations" — lint.run
"post a review comment requesting changes on the regression" — review.post_comment

For each step:

The Router narrows 24 tools to a top-3 shortlist (top_k=3).
The bot picks one tool from the shortlist using an explicit intent map. That separation is the load-bearing pattern: contextweaver bounds the choice, the bot (or, in production, an LLM) makes the final selection.
The tool is "called" against a mocked backend. Large outputs go through the firewall — the 28 KB diff dump and 2.5 KB grep result become 500-char summaries on the prompt while the raw bytes are parked in the artifact store.
Persistent facts (pr.target_file, pr.test_status, pr.type_errors) are written via ContextManager.add_fact_sync so they survive into the answer-phase prompt for every subsequent step.

What's load-bearing

contextweaver feature	Used	What it does here
`Router.route(query)`	✅	Narrows 24 tools → top-3 shortlist (`top_k=3`)
Bounded choice pattern	✅	Bot picks from the shortlist, not from the whole catalog
`TreeBuilder` DAG	✅	One-shot graph build at startup; routes are sub-millisecond
Context firewall	✅✅	Compacts the ~28 KB diff dump and ~2.5 KB grep result down to ~500-char summaries before they touch the prompt
Artifact store	✅	Raw bytes stay addressable for drilldown; only summaries land in the prompt
Persistent facts	✅	Three fact keys survive across all six review steps
Tight per-phase budgets	✅	`ContextBudget(route=1500, call=2500, interpret=2500, answer=3500)` keeps the answer prompt small even after the firewall externalises the heavy bytes

Why this architecture matters

Code-review bots are firewall-bound. Every step produces output that would saturate a model's context window if inlined: a typical PR diff is 10–50 KB, a grep for a renamed symbol returns dozens of hits, lint and typecheck pipelines emit hundreds of lines on a hot patch. Without a firewall, the prompt blows the budget by step 3 and the bot starts truncating mid-review.

contextweaver's Context Engine handles this without a per-tool integration: the firewall fires on any result exceeding firewall_threshold (2 KB default), parks the raw bytes in the artifact store, injects a compact summary on the prompt, and leaves the artifact handle so the LLM can request a drilldown when needed.

What's intentionally not here

Real git integration. Mock tool responses keep the example deterministic and CI-friendly. A real deployment would wire git.diff to subprocess.run(["git", "diff", "main"], ...) or to an MCP server.
LLM-based diff summarisation. The review.summarize_diff tool's canned response is a stand-in for what a Summarizer plugin (issue #26) would do.
Multi-PR session state. The fact store survives the in-process review, but to persist across PRs you would swap the default InMemoryFactStore for a SqliteFactStore (issue #174) or a Mem0 / Zep adapter (issue #195).