Technology

Pre‑generation Hallucination Risk Guard

Idea: hallucinations aren’t random. Estimate risk before answering, bound your hallucination rate with a calibrated threshold, and take safe actions.

How it works

Compute a risk score from signals (margin, entropy, retrieval; optionally self‑consistency, doc support).
Pick a target hallucination rate (e.g., 5%) and calibrate a risk_threshold that satisfies it on samples.
Policy: if risk < threshold → generate; if ≥ threshold → fetch more context; if ≫ threshold → abstain or reroute.

Why pre‑generation risk?

Hallucinations correlate with internal uncertainty and weak support. Estimating risk before generation lets systems avoid costly wrong answers by fetching evidence or abstaining outright, improving precision under fixed latency budgets.

Quick start

// Target rate
target = 0.05
// Calibrate threshold on recent samples
threshold = calibrate(samples, target)
// Decide per request
action = decide(risk, threshold)

Signals and weights

Margin (higher is safer): distance to decision boundary or calibrated logit gap.
Entropy (higher is riskier): distributional uncertainty over next tokens or candidates.
Retrieval (higher is safer): evidence strength from RAG or citations.
Self‑consistency (optional): agreement across sampled decodes or chains.

Calibration procedure

Collect (features, label) pairs where label indicates a known hallucination or correctness.
Compute risk for each sample and sort by ascending risk.
Choose the largest threshold where empirical hallucination rate ≤ target on held‑out data.

Policy integration

Below threshold: generate. Near threshold: fetch more context (expand top‑k, add tools). Above high‑risk buffer: abstain or reroute to a stronger model. Record decisions and outcomes for audits.

Evidence & audits

Log thresholds and outcomes with selective prediction and SLO status inside evidence bundles. Keep it falsifiable and reproducible.

Common pitfalls

Over‑abstaining: re‑calibrate per segment; ensure coverage targets remain useful.
Weak labels: validate sample correctness labels before calibration.
Domain shift: periodically refresh samples; track drift to maintain guarantees.