Auspexi

Innovation Test Harness: Fast, Deterministic Model Evaluation

Deterministic, repeatable evaluation with ablations, stability probes, privacy checks, and exportable evidence. Built for rapid iteration and auditability.

Why a Harness?

Engineering velocity dies when each evaluation is bespoke. The harness provides determinism (fixed seeds, frozen recipes) and coverage (utility, stability, privacy) in a single, repeatable run. It produces signed evidence suitable for internal review and external buyers without exposing proprietary code.

Architecture

Capabilities

Workflow

  1. Select dataset/task (snapshot) and one or more models.
  2. Choose ablations and probes (e.g., ε ∈ {0.5, 0.8, 1.0}; brightness ±15%).
  3. Run the suite; compare lift vs baseline and probe outcomes.
  4. Export the evidence bundle; attach to review, contract, or listing.

Example Config

{
  "task": "classification",
  "dataset": { "name": "aml_graph_v2", "deltaVersion": 143 },
  "models": [
    { "id": "hf:distilroberta-base", "rev": "1.1.0" },
    { "id": "internal:graph-ranker-v3", "rev": "3.2.4" }
  ],
  "ablations": { "features": ["device_id"], "dp_epsilons": [0.5, 0.8, 1.0] },
  "stability": { "packet_drop": 0.1, "time_shift": "7d" },
  "metrics": ["auc", "f1", "ks"],
  "privacy": { "mia": true, "attribute": ["age_band", "segment"] }
}

Evidence Outputs

Reading Results

Try it

Use the Stability Demo and Evidence Export to generate a small, local bundle with simulated data. Deterministic mode ensures identical outputs for identical prompts.