Pre‑generation Hallucination Risk Guard
Idea: hallucinations aren’t random. Estimate risk before answering, bound your hallucination rate with a calibrated threshold, and take safe actions.
How it works
- Compute a risk score from signals (margin, entropy, retrieval; optionally self‑consistency, doc support).
- Pick a target hallucination rate (e.g., 5%) and calibrate a
risk_threshold
that satisfies it on samples. - Policy: if risk < threshold → generate; if ≥ threshold → fetch more context; if ≫ threshold → abstain or reroute.
Why pre‑generation risk?
Hallucinations correlate with internal uncertainty and weak support. Estimating risk before generation lets systems avoid costly wrong answers by fetching evidence or abstaining outright, improving precision under fixed latency budgets.
Quick start
// Target rate
target = 0.05
// Calibrate threshold on recent samples
threshold = calibrate(samples, target)
// Decide per request
action = decide(risk, threshold)
Signals and weights
- Margin (higher is safer): distance to decision boundary or calibrated logit gap.
- Entropy (higher is riskier): distributional uncertainty over next tokens or candidates.
- Retrieval (higher is safer): evidence strength from RAG or citations.
- Self‑consistency (optional): agreement across sampled decodes or chains.
Calibration procedure
- Collect (features, label) pairs where label indicates a known hallucination or correctness.
- Compute risk for each sample and sort by ascending risk.
- Choose the largest threshold where empirical hallucination rate ≤ target on held‑out data.
Policy integration
Below threshold: generate. Near threshold: fetch more context (expand top‑k, add tools). Above high‑risk buffer: abstain or reroute to a stronger model. Record decisions and outcomes for audits.
Evidence & audits
Log thresholds and outcomes with selective prediction and SLO status inside evidence bundles. Keep it falsifiable and reproducible.
Common pitfalls
- Over‑abstaining: re‑calibrate per segment; ensure coverage targets remain useful.
- Weak labels: validate sample correctness labels before calibration.
- Domain shift: periodically refresh samples; track drift to maintain guarantees.
Further reading
- Hassana Labs: Hallucination Risk Calculator & Prompt Re‑engineering Toolkit (OpenAI‑only) — hassana.io/readme.html
- OpenAI: “Why language models hallucinate” (whitepaper) — PDF
Attribution: Inspired by recent work on pre‑generation risk estimation and hallucination controls, including the resources above.
See also: Stability Demo · Whitepaper · FAQ