Evidence Bundles & Testing: Trustworthy AI Without Exposing IP

By Gwylym Owen — 18–24 min read

Executive Summary

AethergenPlatform ships evidence bundles with every model and dataset release: signed metrics, configs, seeds, and hashes that enable buyers and auditors to reproduce claims—without revealing proprietary internals. This approach, builds trust while protecting IP, streamlining adoption for regulated domains like healthcare and finance.

What We Publish

Utility at operating points with confidence intervals.
Stability across segments; drift sensitivity.
Ablation tables: what can change outcomes.
Limits and intended-use statements.
SBOM, artifact hashes, and lineage.
Privacy probes and optional DP budgets.

What We Withhold

Raw training corpora and proprietary recipe internals.
Weights beyond declared formats (unless contracted).
Security-sensitive thresholds and anti-abuse heuristics.

Testing Matrix

Utility at fixed budgets (alerts/day, error tolerance).
Segment stability (product/region/lifecycle).
Robustness (noise/corruptions) where applicable.
Drift early-warning and rollback rehearsal.

Signing & CI

Evidence built via GitHub Actions workflow (`.github/workflows/evidence.yml`) on PRs/tags, running a Node script (`scripts/generate-evidence.cjs`) to create signed ZIPs.
ZIP includes `metrics/`, `plots/`, `configs/`, `seeds/`, `sbom.json`, `manifest.json` (per-file SHA-256 hashes), and `index.json`; appends `.aethergen/change-log.json` for traceability.
Artifacts uploaded with `manifest-hash.txt` for audit; PR comments link to downloads.

Case Study: Healthcare Detector Potential

For healthcare-type customers, we can provide an evidence bundle with operating-point utility (e.g., 1% FPR: +18% cases found, CI ±2.1%), stability across NA, EU, and APAC (max delta 2.8%), and drift monitors triggered quarterly. Procurement can review the HTML dashboard offline, verify hashes, and establish a 6-month refresh cadence with a rollback SOP. This can reduce adoption time from months to weeks

Worked Example: Payments Mule-Ring Detector

For payments-type customers, we can deliver a mule-ring detector bundle with operating-point charts (e.g., 2,000 alerts/day: +11% true positives, stable weekends), parameter logs for ring size/reuse, and stability by product (max delta 1.9%). Buyers can reproduce metrics within CI bands using provided seeds, adopt with a quarterly refresh SLA, and integrate the SBOM into their supply-chain audit

FAQ

How is this different from a slide deck?

It’s reproducible. If you rerun with the same seeds/configs, you get the same metrics within confidence intervals. Unlike a slide deck’s static claims, this ties to verifiable artifacts

What if regulators ask for raw data?

We can provide synthetic corpora with measured fidelity/utility; for restricted cases, data can stay within the customer enclave. We also offer privacy probes to validate compliance

Can buyers request custom evidence?

Yes, within scope—e.g., additional segment stability or robustness tests. CI can regenerate the bundle with a new manifest ID

Glossary

Evidence bundle: signed ZIP package of metrics/configs/hashes.
Operating point: threshold where buyers should evaluate.
SBOM: software bill of materials.

Checklist

Operating points declared; CIs attached.
Stability and drift thresholds documented.
Limits and intended use spelled out.
SBOM and hashes validated.

Contact Sales →

Sample Evidence Bundle Index

 index.json ├─ metrics/ │ ├─ utility@op.json │ ├─ stability_by_segment.json │ ├─ drift_early_warning.json │ └─ robustness_corruptions.json ├─ plots/ │ ├─ roc_pr_curves.html │ ├─ operating_point_tradeoffs.html │ └─ segment_bars.html ├─ configs/ │ ├─ evaluation.yaml │ └─ thresholds.yaml ├─ seeds/ │ └─ seeds.txt ├─ sbom.json └─ manifest.json

Operating Point Examples

Healthcare: 1% FPR → +18% cases found vs baseline, CI ±2.1%.
Payments: 2,000 alerts/day cap → +11% true positives, stable weekends.
Edge vision: 0.92 threshold → rework minutes −9%/1k units.

Audit Workflow

Recompute metrics using provided configs and seeds from the ZIP.
Check CI bands and confirm alignment with published values.
Verify SBOM, artifact hashes, and `manifest-hash.txt`; review `.aethergen/change-log.json`.
Record acceptance and attach evidence IDs to change-control.

Procurement Questionnaire

Which operating points are supported and why?
What are segment stability results and limits?
How are drift incidents detected and rolled back?
What is the refresh cadence and SLA?

Template: Evidence Manifest

 { "version": "2025.01", "artifacts": { "metrics": ["metrics/utility@op.json", "metrics/stability_by_segment.json"], "plots": ["plots/roc_pr_curves.html"], "configs": ["configs/evaluation.yaml", "configs/thresholds.yaml"], "sbom": "sbom.json" }, "hashes": { "metrics/utility@op.json": "sha256:abc123.", "metrics/stability_by_segment.json": "sha256:def456." }, "seeds": "seeds/seeds.txt" }

Appendix: Metric Definitions

Utility@OP: target KPI at specified threshold with CI.
Stability: max performance delta across declared segments.
Drift EW: early-warning index from input/outcome shifts.
Robustness: degradation under controlled corruptions.

Security & Privacy Notes

Evidence contains no PHI/PII; synthetic corpora can be documented separately.
DP budgets can be included via a toggle, publishing epsilon and synthetic ratio where applicable.
Access logs and signatures are recorded for all artifacts in the signed ZIP.

Closing (Comprehensive)

When evidence is part of the product, buyers don’t need persuasion; they need verification. AethergenPlatform turns every release into a verifiable unit of trust—signed, reproducible, and ready for audit