Technology

On‑Device AI: Hybrid Routing & SLOs

TL;DR Hybrid routing prefers CPU/NPU for gates and re‑rankers, with measurable SLOs: fallback rate, battery budget, and thermal guard. Evidence includes sampled telemetry summaries; raw data stays on device.

What we shipped

Optional on‑device SLOs in the SLO service: ondevice.enabled, max_fallback_rate, max_battery_mwh, max_temp_delta_c.
Demo wiring in Stability Demo to simulate on‑device metrics and visualize status.
Docs: On‑Device AI Playbook with routing, packaging, telemetry, and privacy boundaries.

Why on‑device now

Two forces drive on‑device AI adoption: (1) cost and latency benefits from running narrow tasks locally, and (2) stronger privacy and availability guarantees for regulated or intermittent environments. A hybrid policy lets you keep the best of both: CPU/NPU for cheap/fast gates and re‑rankers, and cloud for heavy or unsupported workloads.

Quick start

ondevice: { enabled: true, max_fallback_rate: 0.15, max_battery_mwh: 2.5, max_temp_delta_c: 6 }

During evaluation, provide metrics: fallback_rate, energy_mwh, temp_delta_c. Thresholds can be tuned per device class.

Routing policy

Modes: device‑only, hybrid (default), cloud‑only. Prefer NPU where supported; fallback chain is NPU → GPU → CPU. Exceeding SLOs promotes to cloud or reduces coverage.

SLOs in plain language (with examples)

Fallback‑rate: cap the percent of requests sent to the cloud. Example: max_fallback_rate=0.15 means at most 15% of traffic may offload.
Battery budget: cap energy per inference. Example: max_battery_mwh=2.5 caps per‑request energy; the system reduces coverage or defers when exceeded.
Thermal guard: cap temperature rise. Example: max_temp_delta_c=6 throttles when the device exceeds safe deltas.

Calibration and tuning

Collect a small set of local traces (latency, energy, thermal, fallback reasons).
Set initial SLOs per device class (e.g., handset vs gateway).
Run a short shadow window; measure observed fallback and energy.
Iteratively tighten or relax SLOs until they hold with buffer across segments.

Telemetry boundaries

We log local, sampled summaries (e.g., P95 latency, energy estimate, fallback reason); raw content stays on the device. Optional DP summaries can sync to the platform for fleet‑level analysis.

Evidence & privacy

Evidence bundles include device‑class tags and telemetry summaries. Raw data and residuals remain on device; optional DP summaries may sync.

Integration checklist

Set flags: ONDEVICE_ENABLED, ONDEVICE_HYBRID.
Provide metrics: fallback_rate, energy_mwh, temp_delta_c.
Define action policy when SLOs breach (reduce coverage, defer, or promote to cloud).
Add device‑class tags to evidence bundles.

Troubleshooting

Frequent promotions: tighten selective thresholds or raise max_fallback_rate modestly while monitoring accuracy.
Thermal throttling: reduce batch size, prefer INT8, or elevate ΔT guard with a lower throughput target.
Battery drain: lower coverage or schedule bursts during charge windows.