Skip to content

contextweaver Troubleshooting Guide

Quick reference for common integration problems, debugging techniques, performance optimisation, and frequently asked questions.


1. Overview

Three tools cover the majority of debugging scenarios:

  • BuildStats — inspect pack.stats after every build_sync() / build() call to see exactly what was kept, dropped, deduplicated, and why.
  • Store inspection — query event_log, artifact_store, fact_store, and episodic_store directly to see what data the engine is working with.
  • Router debug trace — pass debug=True to Router.route() to record the beam-search path taken for each query.

If a problem is framework-specific, check the integration guides in docs/: MCP · A2A · OpenTelemetry GenAI.


2. Common Issues & Solutions

Issue 1: Token Budget Too Tight

Symptom:

context build: phase=answer, included=1, dropped=19, tokens=350/6000

pack.stats.included_count is much lower than expected; most items are dropped.

Cause: Phase budget exhausted before all relevant items could be packed.

Solution:

from contextweaver.config import ContextBudget
from contextweaver.context.manager import ContextManager
from contextweaver.types import Phase

# Defaults: route=2000, call=3000, interpret=4000, answer=6000
# Increase any phase that is too tight for your model / use-case
budget = ContextBudget(route=3000, call=5000, interpret=6000, answer=8000)
mgr = ContextManager(budget=budget)

Inspect pack.stats.dropped_reasons to confirm the cause:

pack = mgr.build_sync(phase=Phase.answer, query="...")
print(pack.stats.dropped_reasons)  # e.g. {"budget": 15, "sensitivity": 2}

Issue 2: Context Firewall Intercepted My Tool Result

Symptom:

firewall: intercepted item_id=tr1, summary_len=312

The LLM receives a truncated summary instead of the full tool output.

Cause: The firewall unconditionally intercepts every tool_result item. Raw output is stored out-of-band in ArtifactStore; the LLM only sees a compact summary. This is by design — large outputs would otherwise consume the entire token budget.

Solution — access the raw artifact:

artifact_bytes = mgr.artifact_store.get("artifact:tr1")
full_result = artifact_bytes.decode("utf-8")

Solution — plug in a custom summarizer:

from contextweaver.protocols import Summarizer

class MyDomainSummarizer(Summarizer):
    def summarize(self, text: str, metadata: dict) -> str:
        # Return a richer summary tailored to your domain
        return text[:1000]  # example: keep first 1000 chars

mgr = ContextManager(summarizer=MyDomainSummarizer())

Why the firewall exists: Raw tool results can be megabytes. Storing them out-of-band and injecting summaries keeps prompts deterministic and budget-bounded.


Issue 3: Router Didn't Pick the Expected Tool

Symptom:

Query: "send an email"
Expected: send_email
Actual: send_sms, create_ticket

Cause: TF-IDF scoring favoured other tools; the expected tool's description may not contain the keywords that appear in the query.

Solution A — improve the tool description:

from contextweaver.types import SelectableItem

SelectableItem(
    id="send_email",
    name="send_email",
    description="Send an email message to a recipient address",  # "email" keyword
)

Solution B — widen the beam and increase top-k:

router = Router(graph, items=catalog.all(), beam_width=5, top_k=10)

Solution C — inspect the debug trace:

result = router.route("send an email", debug=True)
for step in result.debug_trace:
    print(step)
# Shows each beam expansion, node scores, and why items were (de-)prioritised

Solution D — render an explanation of the decision surface (RouteResult.explanation()):

RouteResult.explanation() (issue #226) produces a paste-friendly Markdown rationale of why the router ranked the candidates the way it did — top-k table, confidence gap, ambiguity flag, applied context hints, filter counts. Useful in GitHub issues, Slack threads, and PR descriptions when reporting unexpected routing behaviour:

result = router.route("send an email")
print(result.explanation())                # default Markdown form
payload = result.explanation(format="dict")  # versioned structured payload

Sample Markdown output:

### Routing explanation for query `send an email`

_Retriever engine: `tfidf`._

**Top candidates**

| Rank | Tool id | Score |
|---:|:---|---:|
| 1 | `comms.email.send` | 0.8421 |
| 2 | `comms.sms.send` | 0.6105 |
| 3 | `crm.tickets.create` | 0.4988 |

**Confidence gap**: `comms.email.send` (0.8421) vs runner-up `comms.sms.send` (0.6105) = **+0.2316**.

✅ Result is **not** ambiguous (gap above threshold).

The dict form (format="dict") returns a versioned schema ({"version": 1, ...}) that is safe for programmatic consumers (observability spans, automated test assertions). Privacy: the explanation surfaces item ids, scores, and the original query — it never includes args_schema content or full item descriptions.


Issue 4: BuildStats Shows All Items Dropped

Symptom:

pack.stats.total_candidates  # e.g. 20
pack.stats.included_count    # 0
pack.stats.dropped_count     # 20

Cause: Two distinct failure modes — check total_candidates first:

  • total_candidates == 0 — items were excluded before the pipeline started. Phase-kind filtering happens in generate_candidates() (stage 1): items whose ItemKind is not in policy.allowed_kinds_per_phase[phase] are never added as candidates and never appear in dropped_reasons.
  • total_candidates > 0 and dropped_count == total_candidates — items entered the pipeline but were ejected. Check dropped_reasons for the cause.

Valid keys in dropped_reasons:

Key Meaning
"budget" Item doesn't fit in the remaining token budget
"kind_limit" max_items_per_kind cap reached for this ItemKind
"sensitivity" Dropped by sensitivity policy

Diagnosis:

pack = mgr.build_sync(phase=Phase.answer, query="...")
print(pack.stats.total_candidates)  # 0 → items never generated; >0 → items dropped
print(pack.stats.dropped_reasons)
# {"budget": 18, "sensitivity": 2}  → budget is the main cause
# {"kind_limit": 20}               → max_items_per_kind cap reached

# If total_candidates == 0, check the phase-kind policy:
from contextweaver.config import ContextPolicy
from contextweaver.types import Phase, ItemKind

policy = ContextPolicy()
print(policy.allowed_kinds_per_phase[Phase.answer])
# Items whose kind is NOT in this list are silently excluded before scoring

Solution — increase budget or adjust phase policy:

from contextweaver.config import ContextBudget, ContextPolicy
from contextweaver.types import ItemKind, Phase

budget = ContextBudget(answer=12000)

policy = ContextPolicy()
# Ensure tool_result items are permitted in the interpret phase
policy.allowed_kinds_per_phase[Phase.interpret].append(ItemKind.tool_result)

mgr = ContextManager(budget=budget, policy=policy)

Issue 5: Deduplication Removed Important Context

Symptom:

pack.stats.dedup_removed  # e.g. 8

Important items that should be distinct are treated as duplicates.

Cause: Default Jaccard similarity threshold is 0.85. Items with ≥ 85 % token overlap are collapsed.

Fix: Configure dedup_threshold on ScoringConfig:

from contextweaver.config import ScoringConfig
from contextweaver.context.manager import ContextManager

# More conservative: only collapse near-exact duplicates
scoring = ScoringConfig(dedup_threshold=0.95)
mgr = ContextManager(scoring_config=scoring)

# Effectively disable deduplication
scoring = ScoringConfig(dedup_threshold=1.0)
mgr = ContextManager(scoring_config=scoring)

See docs/architecture.md for algorithm details.


Issue 6: async build() Hangs

Symptom:

# Coroutine never completes
pack = await mgr.build(phase=Phase.answer, query="...")

Cause: A blocking call (e.g., slow summarizer, blocking I/O) inside a hook or summarizer can stall the event loop.

Solution:

# Option A: Use build_sync() when you're not in an async context
pack = mgr.build_sync(phase=Phase.answer, query="...")

# Option B: Offload blocking work to a thread pool inside your hook/summarizer
import asyncio

async def my_async_step():
    loop = asyncio.get_running_loop()
    pack = await loop.run_in_executor(None, mgr.build_sync, Phase.answer, "query")

Issue 7: Events Not Appearing in Context (Candidates = 0)

Symptom:

pack.stats.total_candidates  # 0

Cause: Items were never ingested, or their ItemKind is not allowed for the current phase.

Solution:

# Confirm items are in the event log
print(mgr.event_log.count())  # Should be > 0

# Confirm the phase allows the item kind you ingested
from contextweaver.types import Phase, ItemKind
from contextweaver.config import ContextPolicy

policy = ContextPolicy()
allowed = policy.allowed_kinds_per_phase[Phase.route]
print(allowed)  # e.g. [user_turn, plan_state, policy]
# If your item kind is not listed, add it or use a different phase

Issue 8: Token Count Mismatch with External Framework

Symptom: The total estimated tokens in pack.stats are much lower or higher than the token count reported by LlamaIndex, LangChain, or another framework.

Cause: contextweaver uses a CharDivFour estimator by default (1 token ≈ 4 characters). External frameworks often use tiktoken or a model-specific tokeniser.

Solution — compute totals from pack.stats and, if needed, use TiktokenEstimator:

from contextweaver.protocols import TiktokenEstimator

# Compute the total estimated tokens from a build
total_estimated_tokens = (
    sum(pack.stats.tokens_per_section.values())
    + pack.stats.header_footer_tokens
)

# Plug in the built-in tiktoken-backed estimator (requires `tiktoken` package)
mgr = ContextManager(token_estimator=TiktokenEstimator(model="gpt-4"))

Offline / Air-gapped Tiktoken Warning

Symptom:

tiktoken cl100k_base encoding unavailable (...); falling back to chars/4 token estimate

Cause: tiktoken is installed, but it downloads encoding data on first use. Sandboxes, corporate networks, and CI containers that block openaipublic.blob.core.windows.net cannot fetch cl100k_base on demand. contextweaver then falls back to CharDivFourEstimator, which preserves deterministic budget enforcement but is less exact than the real encoding.

Solution — pre-warm a cache on a connected machine, then copy it:

PowerShell:

$env:TIKTOKEN_CACHE_DIR = "C:\path\to\tiktoken-cache"
python -c "import tiktoken; tiktoken.get_encoding('cl100k_base')"

Bash:

export TIKTOKEN_CACHE_DIR=/path/to/tiktoken-cache
python -c "import tiktoken; tiktoken.get_encoding('cl100k_base')"

Copy that cache directory into the offline environment and set TIKTOKEN_CACHE_DIR to the copied path before running contextweaver.

If exact tiktoken parity is not required, no action is needed. The committed scorecard's headline context metrics intentionally use CharDivFourEstimator for network-independent reproducibility; the separate token-estimator parity section reports or skips cl100k_base drift depending on cache availability.

CI regression check: after serialising a session with contextweaver ingest, pin a ceiling with budget-check so prompt-size regressions fail before they reach production:

contextweaver budget-check \
  --session session.json \
  --phase answer \
  --query "current user task" \
  --max-tokens 4000 \
  --breakdown

The command exits 0 when the rendered prompt is within the ceiling and exits 1 when it is over. Add --json for CI parsers, or --ratchet to write the default .budget-baseline.json baseline and fail future runs that grow beyond it. Use --ratchet-path path/to/baseline.json when CI needs a different file.


Issue 9: Graph Build Fails (GraphBuildError)

Symptom:

contextweaver.exceptions.GraphBuildError: cycle detected ...

or

contextweaver.exceptions.GraphBuildError: empty catalog

Cause: - Empty catalog passed to TreeBuilder. - A manually constructed ChoiceGraph introduced a cycle via add_edge().

Solution:

# Ensure catalog is non-empty before building
assert len(catalog.all()) > 0, "Catalog must not be empty"
graph = TreeBuilder(max_children=20).build(catalog.all())

# If you're building the graph manually, edges that form cycles raise
# GraphBuildError immediately — re-check your parent/child assignments.

Issue 10: High Latency in Real-Time Agent

Symptom: Context build takes 200–500 ms, causing perceptible lag.

Cause: Large event logs, aggressive deduplication (O(n²) comparisons), or a slow custom summarizer.

Solution — profile with BuildStats, then tune:

import time

start = time.perf_counter()
pack = mgr.build_sync(phase=Phase.answer, query="...")
elapsed = (time.perf_counter() - start) * 1000

print(f"Build time: {elapsed:.1f} ms")
print(f"Candidates processed: {pack.stats.total_candidates}")
print(f"Dedup removed: {pack.stats.dedup_removed}")

Optimisation checklist: - Use tighter phase budgets to reduce how much content is included in the final pack; this does not reduce how many candidates are processed or scored. - In async runtimes, offload build_sync() to a worker thread with asyncio.to_thread() or loop.run_in_executor() if you need to avoid blocking the event loop; await mgr.build() alone still runs the synchronous pipeline. - Use the default CharDivFour estimator (faster than tiktoken). - Keep the event log shallow: archive old turns to episodic_store and remove them from the active log.


3. Debugging Techniques

Inspect BuildStats

pack = mgr.build_sync(phase=Phase.answer, query="user query")

print(f"Total candidates:   {pack.stats.total_candidates}")
print(f"Included:           {pack.stats.included_count}")
print(f"Dropped:            {pack.stats.dropped_count}")
print(f"Dropped reasons:    {pack.stats.dropped_reasons}")
# e.g. {"budget": 12, "sensitivity": 3, "phase_filter": 0}

print(f"Dedup removed:      {pack.stats.dedup_removed}")
print(f"Dependency closures: {pack.stats.dependency_closures}")
print(f"Token usage:        {sum(pack.stats.tokens_per_section.values())}")
print(f"Tokens per section: {pack.stats.tokens_per_section}")

Inspect the Event Log

from contextweaver.types import ItemKind

# All events
for event in mgr.event_log.all():
    print(f"{event.id} ({event.kind.value}): {event.text[:60]}…")

# Filter by kind
tool_results = mgr.event_log.filter_by_kind(ItemKind.tool_result)
print(f"Tool results in log: {len(tool_results)}")

Inspect Artifacts (Firewall Interceptions)

# List all stored artifacts
for ref in mgr.artifact_store.list_refs():
    print(f"  {ref.handle}  label={ref.label}")

# Retrieve full raw content for a specific artifact
artifact_bytes = mgr.artifact_store.get("artifact:tr1")
print(artifact_bytes.decode("utf-8"))

Inspect Routing Decisions

# Enable debug trace (records each beam-search expansion)
result = router.route("send a reminder email", debug=True)

print("Candidates:", result.candidate_ids)
print("Scores:    ", result.scores)

for step in result.debug_trace:
    print(step)

Inspect Facts and Episodes

# Facts stored by the summarization / extraction pipeline
for key in mgr.fact_store.list_keys():
    for fact in mgr.fact_store.get_by_key(key):
        print(f"{key}: {fact}")

# Episodic summaries
for episode in mgr.episodic_store.all():
    print(f"{episode.episode_id}: {episode.summary[:80]}…")

Enable Debug Logging

import logging

logging.basicConfig(level=logging.DEBUG)
logging.getLogger("contextweaver.context").setLevel(logging.DEBUG)
logging.getLogger("contextweaver.routing").setLevel(logging.DEBUG)

This traces candidate counts, firewall interceptions, scoring, deduplication, and beam-search expansions at every pipeline stage.


4. Performance Optimisation

Latency Sources

Stage Cost Notes
generate_candidates O(n) Scales with event log size
dependency_closure O(n) Usually fast
apply_firewall O(n) + summarizer Summarizer cost is caller-controlled
score_candidates O(n) TF-IDF index built once
deduplicate_candidates O(n²) Main hotspot for large candidate pools
select_and_pack O(n log n) Typically fast
render_context O(n) Fast string assembly

For Low-Latency Agents (real-time, voice)

from contextweaver.config import ContextBudget
from contextweaver.context.manager import ContextManager

# Tighter budgets reduce how much context is retained in the final pack.
# To reduce candidates processed earlier in the pipeline, keep the event log
# short and rely on phase-kind / TTL / sensitivity filtering.
budget = ContextBudget(route=500, call=800, interpret=800, answer=1500)
mgr = ContextManager(budget=budget)

# In async runtimes, offload to a thread to avoid blocking the event loop:
# pack = await asyncio.to_thread(mgr.build_sync, Phase.answer, "...")

# Keep the event log short — archive old turns to episodic_store

For Accuracy-Focused Agents (LlamaIndex, LangChain)

# Larger budgets → more context preserved
budget = ContextBudget(route=3000, call=5000, interpret=6000, answer=10000)
mgr = ContextManager(budget=budget)

# Use the built-in tiktoken-backed estimator matching your LLM
from contextweaver.protocols import TiktokenEstimator

mgr = ContextManager(budget=budget, token_estimator=TiktokenEstimator(model="gpt-4"))

For Large Tool Catalogs (100+ tools)

# Route first, then build context only for the shortlisted tools
result = router.route(user_query, top_k=10)
shortlisted_ids = set(result.candidate_ids)

# Optionally filter ingested tool results to shortlisted tools only
relevant_events = [
    e for e in mgr.event_log.all()
    if e.parent_id in shortlisted_ids or e.id in shortlisted_ids
]

5. FAQ

Q: What are the default token budgets?

A: ContextBudget(route=2000, call=3000, interpret=4000, answer=6000). Tune them based on pack.stats and your model's context window.


Q: Does the context firewall only fire for large tool results?

A: No. The firewall intercepts every tool_result item, regardless of size. Raw content is stored in ArtifactStore; the LLM always sees a compact summary. Access raw data via mgr.artifact_store.get("artifact:<item_id>").


Q: How do I debug what was kept or dropped?

A: Inspect pack.stats after every build:

pack = mgr.build_sync(phase=Phase.answer, query="...")
print(pack.stats.included_count, pack.stats.dropped_count)
print(pack.stats.dropped_reasons)   # breakdown by cause
print(pack.stats.dedup_removed)     # near-duplicates removed

Q: Does this work with [framework X]?

A: contextweaver is framework-agnostic — it compiles context and you send the prompt to any LLM or framework. See the integration guides for MCP and A2A. LlamaIndex, LangChain/LangGraph, OpenAI Agents SDK, and Google ADK guides are in progress.


Q: What's the default deduplication threshold?

A: 0.85 Jaccard similarity. Items with ≥ 85 % token overlap are treated as near-duplicates and collapsed. The higher-scoring item is retained.


Q: Can I persist context across sessions?

A: Yes. Use fact_store and episodic_store to persist memory across turns. Serialise the event log to JSONL with contextweaver ingest / replay CLI commands.


Q: Does contextweaver work with open-source LLMs?

A: Yes. contextweaver is LLM-agnostic. It compiles context; you send pack.prompt to any model.


Q: What's the typical latency overhead?

A: 10–50 ms for a context build with a moderate event log (< 100 events). The main hotspot is Jaccard-based deduplication (O(n²)). Use tighter budgets and shorter event logs for real-time agents.


Q: Can I disable the context firewall?

A: The firewall applies to all tool_result items. To suppress its effect, provide a pass-through summarizer that returns the full text, and ensure your token budget is large enough to accommodate uncompressed results:

from contextweaver.protocols import Summarizer

class PassThroughSummarizer(Summarizer):
    def summarize(self, text: str, metadata: dict) -> str:
        return text  # no truncation

mgr = ContextManager(summarizer=PassThroughSummarizer())

Note: raw content is still stored in ArtifactStore (this cannot be disabled), but the LLM prompt will now contain the full text.


Q: Can I use contextweaver without the routing engine?

A: Yes. ContextManager is fully independent. The routing engine (Catalog, TreeBuilder, Router) is optional and only needed when you want to shortlist tools from a large catalog.


Q: Can I customize the scoring function?

A: Custom scoring is not yet configurable (v0.1). You can influence scoring through ScoringConfig weights (recency_weight, tag_match_weight, kind_priority_weight, token_cost_penalty):

from contextweaver.config import ScoringConfig

scoring = ScoringConfig(recency_weight=0.5, tag_match_weight=0.1)
mgr = ContextManager(scoring_config=scoring)

Q: How do I contribute?

A: See CONTRIBUTING.md for setup instructions, the development workflow, and review guidelines.