On‑Device AI: Hybrid Routing & SLOs
TL;DR Hybrid routing prefers CPU/NPU for gates and re‑rankers, with measurable SLOs: fallback rate, battery budget, and thermal guard. Evidence includes sampled telemetry summaries; raw data stays on device.
What we shipped
- Optional on‑device SLOs in the SLO service:
ondevice.enabled
,max_fallback_rate
,max_battery_mwh
,max_temp_delta_c
. - Demo wiring in Stability Demo to simulate on‑device metrics and visualize status.
- Docs: On‑Device AI Playbook with routing, packaging, telemetry, and privacy boundaries.
Why on‑device now
Two forces drive on‑device AI adoption: (1) cost and latency benefits from running narrow tasks locally, and (2) stronger privacy and availability guarantees for regulated or intermittent environments. A hybrid policy lets you keep the best of both: CPU/NPU for cheap/fast gates and re‑rankers, and cloud for heavy or unsupported workloads.
Quick start
ondevice: { enabled: true, max_fallback_rate: 0.15, max_battery_mwh: 2.5, max_temp_delta_c: 6 }
During evaluation, provide metrics: fallback_rate
, energy_mwh
, temp_delta_c
. Thresholds can be tuned per device class.
Routing policy
Modes: device‑only, hybrid (default), cloud‑only. Prefer NPU where supported; fallback chain is NPU → GPU → CPU. Exceeding SLOs promotes to cloud or reduces coverage.
SLOs in plain language (with examples)
- Fallback‑rate: cap the percent of requests sent to the cloud. Example:
max_fallback_rate=0.15
means at most 15% of traffic may offload. - Battery budget: cap energy per inference. Example:
max_battery_mwh=2.5
caps per‑request energy; the system reduces coverage or defers when exceeded. - Thermal guard: cap temperature rise. Example:
max_temp_delta_c=6
throttles when the device exceeds safe deltas.
Calibration and tuning
- Collect a small set of local traces (latency, energy, thermal, fallback reasons).
- Set initial SLOs per device class (e.g., handset vs gateway).
- Run a short shadow window; measure observed fallback and energy.
- Iteratively tighten or relax SLOs until they hold with buffer across segments.
Telemetry boundaries
We log local, sampled summaries (e.g., P95 latency, energy estimate, fallback reason); raw content stays on the device. Optional DP summaries can sync to the platform for fleet‑level analysis.
Evidence & privacy
Evidence bundles include device‑class tags and telemetry summaries. Raw data and residuals remain on device; optional DP summaries may sync.
Integration checklist
- Set flags:
ONDEVICE_ENABLED
,ONDEVICE_HYBRID
. - Provide metrics:
fallback_rate
,energy_mwh
,temp_delta_c
. - Define action policy when SLOs breach (reduce coverage, defer, or promote to cloud).
- Add device‑class tags to evidence bundles.
Troubleshooting
- Frequent promotions: tighten selective thresholds or raise
max_fallback_rate
modestly while monitoring accuracy. - Thermal throttling: reduce batch size, prefer INT8, or elevate
ΔT
guard with a lower throughput target. - Battery drain: lower coverage or schedule bursts during charge windows.
See also: Resources · FAQ – On‑Device AI & SLOs · Stability Demo · On‑Device AI Playbook