Always‑on Evaluators: Cheap, Continuous Risk Scoring for Reliable AI
Summary: Compact small‑language‑model evaluators run per‑turn to score metrics like toxicity, PII, prompt injection, bias, and jailbreaking. Scores feed a pre‑generation Risk Guard that decides whether to generate, fetch more context, reroute, or abstain (fail‑closed). Result: fewer tokens and calls, lower latency and energy, and auditable behavior.
Why evaluators?
Generative systems fail in the gaps: thin context, adversarial prompts, or unclear objectives. Instead of hoping, we measure. A set of compact evaluators score each turn and enforce policy before we generate. This is inexpensive enough to run continuously on CPU/NPU.
What we score
- Prompt injection, PII leakage, tool‑error patterns
- Toxicity, bias, jailbreak attempts
- Optional goal‑completion heuristics for tool‑using agents
How it integrates
- Signals → Risk Guard: evaluator scores are inputs to our pre‑generation gate alongside retrieval support, margin, and entropy.
- Actions: below threshold → generate; above → fetch more context; well above → abstain or reroute.
- Evidence: we log
evaluation_events.json
and an aggregated evaluation_summary.json
into the signed evidence bundle.
- Runtime: CPU/NPU‑first via our runner; GPU optional. Cheap metrics parallelised; heavier ones sampled.
Controls
- Per‑metric thresholds with a fail‑closed policy
- Segment‑aware calibration (e.g., by product, region)
- Live toggle in the Stability Demo: “Always‑on Evaluators (20 metrics)”
Impact
- Token savings via early abstention and rerouting
- Latency reduction by avoiding unnecessary large‑model calls
- Energy and cost reduction; on‑device friendly
- Auditability through signed evidence
Read next: Evidence‑Efficient AI (73%) · A Billion Queries · Dashboard