Adopter Benchmark Report
A decision-oriented companion to the generated benchmark scorecard. The scorecard is the source of truth for raw numbers; this page explains what those numbers mean for an engineering lead evaluating contextweaver.
Executive Summary
contextweaver helps most when an existing agent already works, but its prompt is being crowded by tool schemas, tool results, or long tool-use history.
| Best fit | Why contextweaver helps |
|---|---|
| MCP / FastMCP / Python agents with many tools | The Routing Engine shows the model a bounded ChoiceCard shortlist instead of the whole tool catalog. |
| Tools that return large JSON, logs, tables, CSV, or binary payloads | The context firewall stores raw output as artifacts and injects compact summaries. |
| Long multi-turn agents with tool calls/results | Phase-specific context builds keep dependency chains intact while dropping lower-value context first. |
| Poor fit | Why it may not help |
|---|---|
| A small agent with 3-5 tiny tools and short conversations | There may be no prompt-budget pressure to relieve. |
| A team looking for an agent framework or model SDK | contextweaver does not run the loop or call the LLM. |
| A pure RAG or long-term-memory problem | Use a retrieval or memory system first; contextweaver can budget their outputs later. |
| A very large catalog with sparse metadata and only lexical scoring | Default recall degrades at large catalog sizes; use a stronger retriever or improve catalog metadata. |
Prompt-Size and Cost Model
The committed scorecard compares contextweaver against a naive baseline that
concatenates all tool schemas plus full conversation history. The token counts
below come from benchmarks/results/latest.json and are rendered in
benchmarks/scorecard.md.
Cost math is intentionally model-agnostic:
cost_per_1k_turns = tokens_per_turn * 1000 * input_price_per_1m_tokens / 1_000_000
Use your own provider's input-token price. The worked example below uses an
illustrative input_price_per_1m_tokens = $1.00.
| Scenario | Naive tokens | contextweaver tokens | Reduction | Cost / 1k turns naive | Cost / 1k turns cw |
|---|---|---|---|---|---|
large_catalog |
2,767 | 1,514 | 45.28 % | $2.77 | $1.51 |
long_conversation |
4,365 | 2,548 | 41.63 % | $4.37 | $2.55 |
mixed_payload |
2,277 | 497 | 78.17 % | $2.28 | $0.50 |
short_conversation |
1,946 | 496 | 74.51 % | $1.95 | $0.50 |
stress_conversation |
17,482 | 6,651 | 61.96 % | $17.48 | $6.65 |
tiny_payload |
1,704 | 267 | 84.33 % | $1.70 | $0.27 |
Average across the six committed scenarios: 5,090 naive tokens vs 1,996
contextweaver tokens per answer-phase build. At the illustrative price above,
that is about $5.09 vs $2.00 per 1,000 similarly-shaped turns.
Important: this is a prompt-input estimate, not an end-to-end production cost guarantee. Completion tokens, model latency, retries, tool latency, cache hits, and business logic all live outside this offline harness.
Latency Model
Latency numbers in the scorecard are hardware-dependent. Treat them as ordering and scale signals, not as service-level objectives.
| Catalog size | Queries | Recall@5 | MRR | p50 route latency | p95 route latency | p99 route latency |
|---|---|---|---|---|---|---|
| 50 | 131 | 0.5649 | 0.4978 | 0.423 ms | 0.548 ms | 0.759 ms |
| 83 | 200 | 0.3825 | 0.3242 | 0.667 ms | 0.779 ms | 1.134 ms |
| 1000 | 200 | 0.1475 | 0.1456 | 36.007 ms | 37.467 ms | 41.711 ms |
The context-build side is measured as token output and budget behavior rather
than wall-clock latency in the public scorecard. For real deployments, measure
your own event-log shape with OpenTelemetry or a wrapper around
ContextManager.build_sync(...).
Routing-Quality Model
The default routing backends are deterministic and transparent. They are also not magic: lexical scoring degrades as catalog size and noise increase.
| Observation | What it means |
|---|---|
| Recall@5 is 0.5649 at 50 tools and 0.1475 at 1000 tools in the headline run. | Routing-only adoption is strongest for small-to-medium catalogs or catalogs with strong metadata. |
The scorecard includes tfidf, bm25, embedding_hashing, and skipped optional backends. |
Backends are swappable; compare them on your catalog rather than assuming the default is best. |
| The mixed-shape 500-tool catalog reports lower recall than the uniform catalog. | Real catalogs with head-heavy namespaces and long tails need better metadata or a retriever-first shortlist. |
Recommended mitigation path for large catalogs:
- Improve
SelectableItem.description,tags, andnamespacequality. - Compare
tfidf,bm25, andembedding_hashingwithmake benchmark-matrix. - Add a retriever-first shortlist through the
Retrieverprotocol when the catalog is too large or too semantically sparse for lexical ranking alone.
Failure Modes Avoided
| Failure mode | What contextweaver does |
|---|---|
| Giant tool output pasted raw into the answer prompt | Stores the raw payload as an ArtifactRef and injects a summary. |
| Full schema catalog injected on every route turn | Exposes compact ChoiceCards and hydrates full schema only when needed. |
| Manual history trimming drops a parent tool call | Dependency closure keeps selected tool results paired with their parent calls. |
| Confidential context crosses a policy boundary | Sensitivity filtering drops or redacts according to ContextPolicy. |
| Debugging prompt changes becomes guesswork | BuildStats, route traces, and explanations show what was included, dropped, and why. |
Reproduce
make benchmark-matrix
make scorecard
make scorecard-check
Source files:
| File | Role |
|---|---|
benchmarks/results/latest.json |
Machine-readable benchmark output. |
benchmarks/scorecard.md |
Generated scorecard and methodology. |
benchmarks/benchmark.py |
Harness implementation. |
benchmarks/scenarios/ |
Context-pipeline fixtures. |
benchmarks/routing_gold.json |
Routing gold set. |
Deterministic metrics: recall, drops, dedup counts, token counts, artifact counts, and compaction ratios.
Hardware-dependent metrics: p50/p95/p99 latency.
How to Decide
Adopt a small proof-of-concept if at least one of these is true:
- Your prompt regularly includes more tools than the model should reason about.
- A single tool can return enough data to dominate the prompt.
- Your agent history includes tool-call chains that manual truncation breaks.
- You need deterministic context budgeting and inspectable drop reasons.
Defer contextweaver if none of those are true yet. Keep the scorecard commands around, then re-run them once the agent grows into the problem.