Context Engineering: From RAG to Reliable Answers
Auspexi • Updated:
TL;DR: We add a context layer on top of RAG: hybrid retrieval (BM25+dense+reranker), context signals (retrieval_margin, support_docs, recency, trust, format), token budget packing, and evidence provenance. Signals feed the Risk Guard to fetch/clarify/abstain before generation.
Why context wins (a short story)
Two identical models answer the same question. One sees a clean, relevant brief with citations; the other sees a noisy dump. The first answers correctly and concisely; the second hallucinates. Same model, different context. Our job is to engineer the brief.
Hybrid Retrieval
- Combine BM25 + dense embeddings and an optional cross‑encoder reranker
- MMR‑style de‑duplication; boosts for recency and source trust
Signals and Policy
- Signals: retrieval_margin, support_docs, recency_score, source_trust, format_health
- Policy: if signals are thin → fetch more, clarify, or abstain; otherwise generate
Signals map into our pre‑generation Risk Guard. High margin + strong support lowers risk; low margin + weak support triggers fetch or abstain. This prevents wasted tokens and bad answers.
What good context looks like
- Short and scoped: only spans that directly answer the question
- Citation‑ready: include source IDs and hashable excerpts
- Fresh enough: decay old content unless explicitly requested
- Typed tool outputs: strict JSON or tables; no blob dumps
Budget packing
We score spans, then pack them into a token budget so the model sees the best 2–3 pages of truth instead of 30 pages of noise. The remainder is accessible on demand.
Evidence
We ship context_provenance.json inside evidence zips with per‑query signals and included sources for audit.
Measuring improvement
- Retrieval: P@k, nDCG@k, citation‑hit rate
- Answer quality: correctness@k, abstain rate at fixed coverage
- Ops: token cost per accepted answer, latency p95
Get Started