Evidence‑Efficient AI: 73% Token and 73% Latency Savings (NYC Taxi Demo)

Auspexi • Updated:

TL;DR: We reduced tokens and latency by 73% and avoided all large‑model calls on a realistic task using open NYC Taxi anchors. In plain English: we now fetch only what we need, think with smaller pieces, and choose when to answer or ask for more context. This is faster, cheaper, and easier to trust.

What this means for you

Faster answers on normal hardware: a small on‑device model handles most work. You feel the speed.
Lower bills: big models run rarely. Fewer tokens moved and processed.
Fewer made‑up answers: when confidence is low, the system fetches more or abstains instead of guessing.

What this means for Aethergen

We can operate with service targets for reliability and speed, and we can prove it. The platform composes small specialist pieces with clear guardrails and exports evidence so teams can review decisions without exposing raw data.

73%

Tokens reduced

73%

Latency improvement

100%

Large‑model calls avoided

80–98%

Storage saved (typical)

How we did it (simple version)

Retrieve then read: we do not try to remember the whole web. We fetch only a few relevant facts, then pack them efficiently.
Small‑model first: a small model drafts answers. Only hard cases consider heavier tools.
Risk check before answering: if confidence is low, we fetch more context or abstain. That stops low‑quality output.
Compact memory: we keep tiny anchors and compressed vectors instead of storing raw pages.

What we measured

Using the NYC Taxi Open Anchor Pack, we ran 40 queries. Factual questions used a tiny context and answered immediately. Broader summary prompts used a compact context and asked for more only when needed. We logged tokens, latency, routing actions, and exported an evidence summary.

Latest results (1,000,000,000 queries)

Tokens: 72% reduction (102,000,000,000 → 28,999,999,925)
Latency: 73% reduction (915,000,000,000 ms → 251,000,000,000 ms)
Large‑model calls: 100% avoided (1,000,000,000 → 0)

Baseline vs Composed — Tokens

Baseline vs Composed — Latency

How storage falls (and why it matters)

Anchors over raw pages: we keep compact facts and summaries (“anchors”) instead of whole documents.
PQ‑compressed vectors: embeddings are product‑quantized, cutting space by an order of magnitude.
Deltas and TTL: we store changes and expiry, not infinite history. Legal filters remove low‑value content.

Result: it’s common to replace multi‑terabyte corpora with a working set that’s in the 80–98% smaller range while preserving the ability to answer the same questions—with stronger provenance.

Are we ready for commercial success

We are ready to run pilots and production trials with clear guardrails. The approach is reliable by design: it is faster and cheaper while preserving the option to abstain. It ships with evidence so procurement, risk, and engineering can verify how results were produced.

What this is not

Not a claim that we memorized the world. We deliberately avoid that.
Not a promise of perfection. It is a controlled system that prefers to fetch more or abstain when uncertain.
Not tied to a specific vendor. It works with different model sizes and backends.

What you can do next

Explore the Context Engineering page to see how we select and pack information.
Try the Model Starters gallery and pick the right starter for your task.
Read the Hallucination Controls explainer to understand the risk checks.
Contact us about a pilot with your data in your environment.

Questions or pilot requests: /contact