Benchmarks

contextweaver ships a public, reproducible benchmark scorecard so you can see top-k recall, token savings, and routing latency before integrating.

TL;DR

What	Where
The adopter-facing interpretation	`docs/benchmark_report.md` — cost, prompt-size, latency, and failure-mode framing
The numbers	`benchmarks/scorecard.md` — committed, regeneratable in one command
The raw output	`benchmarks/results/latest.json` — machine-readable
The harness	`benchmarks/benchmark.py`
The methodology	`benchmarks/README.md`

Reproduce locally:

make benchmark   # writes benchmarks/results/latest.json
make scorecard   # writes benchmarks/scorecard.md from latest.json
git diff --quiet benchmarks/scorecard.md   # passes on clean re-run with same seed

The check git diff --quiet benchmarks/scorecard.md is the determinism gate: identical inputs must produce byte-identical scorecard output.

Scaling and trend

Beyond the headline scorecard, three companion benchmarks track behaviour at scale and over time:

What	Command	Output
Latency to 10k tools + cache speedup	`make benchmark-routing-scale`	`scaling matrix`
300+ tool recall, namespace/deny filters, and firewall view	`make benchmark-large-catalog`	`benchmarks/large_catalog_scorecard.md`
ChoiceCards, ambiguity/destructive filtering, and large-result exposure	`make benchmark-scenario`	`benchmarks/scenario_routing.md`
Release-over-release trend	`make trend`	`benchmarks/trend.md`

PR-time regressions are caught by the gating check (benchmarks/gating.yaml + scripts/benchmark_gate.py): a PR that drops a gated quality metric beyond its tolerance band fails CI unless the benchmark-accepted label is applied with a rationale. See the scaling matrix page for the full methodology.

What is measured

Routing. Precision\@k, recall\@k, MRR, and p50/p95/p99 latency at catalog sizes 50 / 83 / 1000 against a fixed gold dataset of 200 hand-curated queries covering all 8 catalog namespaces (benchmarks/routing_gold.json). The 1000-item catalog is the natural 83-item pool extended with synthetic variants so beam search has to work harder; the gold queries only match the non-synthetic items, so accuracy numbers stay valid as catalog size scales.

Context pipeline. For each scenario in benchmarks/scenarios/:

event_count, items_included, items_dropped (budget pressure), dedup_removed (near-duplicate removal)
prompt_tokens and budget_utilization_pct against the answer-phase budget (default 6000 tokens)
artifacts_created (firewall interceptions) and avg_compaction_ratio (raw artifact size ÷ injected summary size)

The current scenarios:

Scenario	Turns	Purpose
`short_conversation.jsonl`	5 user turns	Baseline; low pressure
`long_conversation.jsonl`	20 user turns	Mid pressure with one large invoice search result
`large_catalog.jsonl`	15 user turns	Routing breadth across all 8 namespaces
`stress_conversation.jsonl`	50+ user turns	Stresses budget-driven drop + dedup + multi-result compaction (#181)

How to read these numbers

The scorecard reports four distinct kinds of measurement; each one means something different and degrades under different conditions:

Metric	What it measures	What's stable	What isn't
`recall@k`, `precision@k`, `MRR`	Whether the default lexical scorer (`tfidf` baseline, `bm25` optional) puts the gold-set tool inside the top-`k` shortlist	Reproducible across machines for a given seed and gold set	A floor, not a ceiling — the scorer is pluggable via `Router(scorer_backend=...)`; embedding-based and LLM-rerank backends are out of scope for this baseline scorecard
`p50` / `p95` / `p99` latency (ms)	Wall-clock cost of one `Router.route(query)` call	Relative ordering (50 < 83 < 1000 always)	Absolute microseconds vary with hardware, Python build, and other load on the runner; the docs explicitly call this out
`prompt_tokens`, `budget_utilization_pct`	Output of the Context Engine for a fixed scenario log	Reproducible across machines (`CharDivFourEstimator`, no `tiktoken` state dependency)	The scenarios are illustrative — your event log will produce different numbers
`avg_compaction_ratio`, `artifacts_created`	Firewall behaviour: how much raw tool output is intercepted	Reproducible	Driven by the scenarios' tool outputs, not by a property of contextweaver — if your real tool calls return tiny payloads, compaction is `1.0×`

Two practical reading rules:

Treat recall@5 at catalog_size 1000 as a baseline number, not a ceiling. The default tfidf and bm25 scorers are lexical; they trade recall for transparency and zero-network determinism. A retriever-first shortlist (embedding ANN or LLM rerank) is the documented way to recover recall at scale — tracked under issue #8.
Token-reduction percentages are scenario-bound. "41.6 %-84.3 %" is the range across the six committed scenarios, not a guarantee for any specific workload. The right number for your agent is whatever you measure on your own event logs using scripts/baseline_naive.py.

Known limits and honest framing

The numbers above describe what the default contextweaver configuration does on the default benchmark fixtures. The following are deliberate, documented gaps — not bugs:

Routing recall degrades with catalog size. At catalog_size 50, recall@5 is ~0.56. At 1000, it falls to ~0.15 with both tfidf and bm25. This is the well-known lexical-retrieval ceiling on large tool catalogs. contextweaver's response is to make the scorer pluggable (see Router(scorer_backend=...) and the Retriever protocol on the EngineRegistry), not to claim the default scorer is sufficient at scale. The scorecard exists so you can see this drop yourself, swap in a stronger backend, and re-measure.
Token reduction is scenario-driven. All percentages above are measured against the six committed scenarios under benchmarks/scenarios/. If your real conversations or tool outputs differ materially in shape (much shorter turns, much larger results, very different namespace distribution), expect different percentages.
Latency is informational. Treat the latency numbers as ordering between catalog sizes, not as a service-level objective for production deployments.
No claim is made about end-to-end agent cost, answer quality, or accuracy — those depend on the model and the agent loop, not on contextweaver alone.

If you find a regime where the default backends underperform, file an issue with your gold set and catalog — the harness is designed to absorb new scenarios.

Gateway scenarios (range)

The MCP Context Gateway reference architecture (examples/architectures/mcp_context_gateway/) ships a single-turn reference run that exercises the launch narrative end-to-end. As of issue #270 the harness also runs five deterministic gateway scenarios over the same 60-tool catalog so the reduction number is reported as a measured range, not a single anecdote:

Scenario	Raw chars	Summary chars	Firewall reduction	Artifact
`tiny_ack`	34	34	0.0 %	—
`small_post`	77	77	0.0 %	—
`medium_ticket`	238	238	0.0 %	—
`large_log`	14,206	501	96.5 %	✓
`bigquery_rowset`	16,507	194	98.8 %	✓

Headline range: firewall reduction across five gateway scenarios runs 0.0 % – 98.8 %. The 0.0 % cases are not regressions — they're the firewall correctly no-op'ing on payloads under the firewall_threshold=2000 threshold. The full per-scenario detail (routing query, chosen tool, exposed cards, answer-phase tokens) is committed at benchmarks/gateway_scorecard.md.

The single-turn architecture run still produces the same metrics (bigquery_rowset row above, plus hydrated_schema_chars = 854 and final_prompt_tokens = 142 from the answer-phase build); see examples/architectures/mcp_context_gateway/OUTPUT.md for the captured run.

Reading this honestly: the 98.8 % high-water mark only applies to the bigquery_rowset scenario. A real gateway deployment's per-call reduction will vary with the shape of each upstream response; the explicit floor at 0.0 % is the load-bearing claim that the firewall does not penalise small responses.

Reproduce byte-for-byte via:

python examples/architectures/mcp_context_gateway/main.py   # single-turn run
make benchmark-gateway && make gateway-scorecard            # 5-scenario range
make gateway-scorecard-check                                # CI: byte-stable

Why this exists

Pre-scorecard, the README compared "70% lower cost" and "sub-second latency" as illustrative arithmetic — useful as intuition, but not tied to any specific catalog, dataset, or configuration. The scorecard fills that gap: public numbers that anyone can regenerate from a clean checkout, that CI re-runs as an informational step on every PR, and that downstream library authors can use as a reference for what "good" looks like.

Scope notes

This first cut of the scorecard ships:

the default scorer backend (tfidf)
catalog sizes 50 / 83 / 1000 as defined by benchmarks/benchmark.py
four scenarios under benchmarks/scenarios/

Tracked as follow-ups (see issue #197):

Per-backend matrix comparing tfidf vs bm25 vs fuzzy
100 / 500 / 1000 catalog sizes (rename + scenario tune to match)
Weekly scheduled CI run that regenerates the scorecard and opens a drift PR
Hardware reference table — pin a specific runner spec for absolute-latency numbers