Innovation Test Harness: Fast, Deterministic Model Evaluation
Deterministic, repeatable evaluation with ablations, stability probes, privacy checks, and exportable evidence. Built for rapid iteration and auditability.
Why a Harness?
Engineering velocity dies when each evaluation is bespoke. The harness provides determinism (fixed seeds, frozen recipes) and coverage (utility, stability, privacy) in a single, repeatable run. It produces signed evidence suitable for internal review and external buyers without exposing proprietary code.
Architecture
- Deterministic runs: fixed seeds, pinned versions, dataset snapshots (Delta commit/version).
- Reproducible configs: JSON/YAML recipes; recipe hash recorded.
- Ablations & probes: module toggles, feature families, DP budgets; stability under noise/fault injections.
- Evidence export: signed manifest with metrics and deltas; checksums + SBOM for artifacts.
Capabilities
- Model registry: Hugging Face + proprietary adapters with semantic versioning.
- Task adapters: classification, ranking, regression, and generative scoring.
- Ablations: feature drops, module on/off, DP ε schedules; quantisation levels.
- Stability: jitter/dropout, domain shifts, corruption (vision/text), drift sweeps.
- Privacy: membership inference, attribute disclosure; optional DP accounting.
- Exports: CSV/Parquet summaries, JSON evidence bundle, signed manifest.json.
Workflow
- Select dataset/task (snapshot) and one or more models.
- Choose ablations and probes (e.g., ε ∈ {0.5, 0.8, 1.0}; brightness ±15%).
- Run the suite; compare lift vs baseline and probe outcomes.
- Export the evidence bundle; attach to review, contract, or listing.
Example Config
{
"task": "classification",
"dataset": { "name": "aml_graph_v2", "deltaVersion": 143 },
"models": [
{ "id": "hf:distilroberta-base", "rev": "1.1.0" },
{ "id": "internal:graph-ranker-v3", "rev": "3.2.4" }
],
"ablations": { "features": ["device_id"], "dp_epsilons": [0.5, 0.8, 1.0] },
"stability": { "packet_drop": 0.1, "time_shift": "7d" },
"metrics": ["auc", "f1", "ks"],
"privacy": { "mia": true, "attribute": ["age_band", "segment"] }
}
Evidence Outputs
- Utility: AUC/F1/MAE/KS deltas vs baseline; confidence intervals.
- Stability: drift sensitivity and performance under corruptions.
- Privacy: MIA/attribute disclosure summaries; DP ε/δ when used.
- Lineage: dataset versions, recipe hash, artifact checksums, container digest.
- Ablation table: per‑toggle deltas for review.
Reading Results
- Highlight changes that matter with effect sizes and uncertainty.
- Fail‑closed: if privacy bounds exceed thresholds or utility regresses beyond tolerance.
Try it
Use the Stability Demo and Evidence Export to generate a small, local bundle with simulated data. Deterministic mode ensures identical outputs for identical prompts.