Auspexi

Evidence in CI: Failing Closed and Passing Audits with Confidence

By Gwylym Owen — 45–60 min read

Executive Summary

Shipping AI in regulated environments demands more than accuracy. It requires evidence regenerated on every change, gates that fail closed, and artifacts that auditors can file. AethergenPlatform bakes evidence generation into CI so every release carries signed metrics, configurations, seeds, and hashes. If gates fail, promotion is blocked. If gates pass, procurement has everything they need.

Why Evidence in CI

Core Gates

Pipeline Overview

commit → build → evaluate → evidence → gates → package → publish
                       ↘ fail‑closed ↗

Operating Points (OPs)

Configuration

op:
  fpr: 0.01
stability:
  region_max_delta: 0.03
  product_max_delta: 0.02
latency:
  p95_ms: 120
privacy:
  membership_advantage_max: 0.05

Evidence Artifacts

Manifest (Sketch)

{
  "version": "2025.01",
  "artifacts": [
    "metrics/utility@op.json",
    "metrics/stability_by_segment.json",
    "plots/roc_pr_curves.html",
    "configs/evaluation.yaml",
    "sbom.json"
  ],
  "hashes": {"metrics/utility@op.json": "sha256:..."},
  "seeds": "seeds/seeds.txt"
}

Fail‑Closed Logic

if not gates.utility.pass: fail()
if not gates.stability.pass: fail()
if not gates.latency.pass: fail()
if not gates.privacy.pass: fail()
package_and_publish()

Dashboards

Reproducibility

Example Utility@OP

{
  "op": "fpr=0.01",
  "global": {"metric": 0.758, "ci": [0.749, 0.767]},
  "segments": {
    "region": {"NA": 0.761, "EU": 0.753, "APAC": 0.749}
  }
}

Latency Report

{"p50": 42, "p95": 97, "p99": 142}

Privacy Probes

{"membership_advantage": 0.03, "ci": [0.01, 0.05], "threshold": 0.05}

Ablations

Case Study: Healthcare Detector

Gates included OP utility (fpr=1%), region stability ≤0.03, and p95≤120ms. An ablation adding claim‑graph motifs increased utility by +3.8% (CI +3.0, +4.6). Privacy probes remained under thresholds. Release passed, with evidence attached to the change and filed with procurement.

Case Study: Edge Vision Station

Station latency spiked in self‑tests; fail‑closed blocked promotion. Fallback model profile reduced latency; Gates re‑ran green. Evidence recorded both attempts; procurement filed the passing manifest.

Runbooks

Incident

  1. Detect breach in staging/production; snapshot evidence.
  2. Rollback automatically; notify owners.
  3. Open incident; attach evidence; analyze deltas by segment.
  4. Patch; run shadow; promote after gates pass.

Promotion

  1. Build candidate; compute evidence.
  2. Verify gates; get QA sign‑off.
  3. Publish package; register in catalog; post release notes.

Templates

release.yaml
gates:
  utility@op: {min: 0.75}
  stability: {region_max_delta: 0.03}
  latency: {p95_ms: 120}
  privacy: {membership_advantage_max: 0.05}

Catalog Hooks

Procurement Alignment

Security

FAQ

Why fail‑closed?

It prevents risky promotions and replaces “it should be fine” with numbers.

Do we need all gates for every release?

Yes for production promotion; experimental branches can skip but cannot be deployed.

How long do CIs take?

We balance fidelity and timeliness; dashboards are cached; heavy jobs run in parallel.

Glossary

Checklists

Appendix: CI Pseudocode

stage evaluate:
  run utility@op
  run stability
  run latency
  run privacy

stage evidence:
  export plots
  write manifest.json
  sign artifacts

stage gates:
  if any fail → exit 1

stage package:
  tar models+configs+evidence

Appendix: Release Notes Template

Release: model‑x 2025.01
OP: fpr=1%
Utility: 0.758 [0.749,0.767]
Stability max deltas: region 0.012, product 0.015
Latency: p95 97ms
Privacy: membership_advantage 0.03 (≤0.05)
Manifest: 8e7...

Closing

Evidence in CI is how we ship trust, not just code. With AethergenPlatform, every release is an auditable unit: numbers, artifacts, and controls. Gates make safety routine; evidence makes adoption fast.

Narrative: Audit Day Walkthrough

The auditor receives the release folder. They open the HTML dashboards offline, check the manifest hashes, and confirm OP and stability bands. They review the SBOM and sign the acceptance form that references the bundle ID. No screen‑sharing, no chasing. Everything is in the box.

Acceptance Form (Template)

Acceptance: model‑x 2025.01
Bundle ID: 8e7...
OP: fpr=1% | Utility: 0.758 [0.749,0.767]
Stability: region<=0.03 | product<=0.02
Latency: p95=97ms | p99=142ms
Privacy: membership_advantage=0.03 (<=0.05)
SBOM: present | Manifest: present
Signatures: verified
Accepted by: ____________ Date: ______

Extended Gates Catalog

Environment Fingerprints

{
  "python": "3.11.6",
  "cuda": "12.1",
  "libraries": {
    "numpy": "1.26.4",
    "pandas": "2.2.1"
  }
}

Red‑Team Exercises

Rollout Strategies

CI/CD Example (YAML)

steps:
  - name: build
    run: make build
  - name: evaluate
    run: make evaluate
  - name: evidence
    run: make evidence
  - name: gates
    run: make gates
  - name: package
    run: make package

Dashboard Sections

Segment Table (Example)

segment, metric, ci_low, ci_high, delta
NA, 0.761, 0.752, 0.770, +0.003
EU, 0.753, 0.744, 0.762, -0.005
APAC, 0.749, 0.740, 0.758, -0.009

Explainers

Threshold Tables

thresholds:
  model_x:
    op: 0.73
    region_bands: 0.03
    product_bands: 0.02

Run Log (Excerpt)

2025-01-22 08:12Z evaluate utility@op ... OK
2025-01-22 08:13Z evaluate stability ... OK (max delta region=0.012)
2025-01-22 08:14Z evaluate latency ... OK (p95=97ms)
2025-01-22 08:15Z evaluate privacy ... OK (adv=0.03)
2025-01-22 08:16Z evidence bundle ... OK (manifest=8e7...)
2025-01-22 08:17Z gates ............ PASS
2025-01-22 08:18Z package .......... OK

Support & SLAs

Common Pitfalls

Legal Hooks

Extended FAQ

Can we waive a gate?

Only with explicit approval and compensating controls; waiver recorded with expiry.

What if privacy probes oscillate near thresholds?

Increase sample sizes and use moving averages with alarms; document decisions.

Do we need robustness gates for text?

Often no; but drift and stability are still mandatory.

Appendix: CSV/JSON Schemas

utility@op.csv: segment:string,metric:float,ci_low:float,ci_high:float,delta:float
latency.json: {"p50":int,"p95":int,"p99":int}

Appendix: Checklist (One‑Pager)

[ ] OP utility with CIs
[ ] Stability bands
[ ] Latency SLOs
[ ] Privacy probes
[ ] SBOM and signatures
[ ] Manifest hashes
[ ] Release notes with bundle ID

Closing Notes (Extended)

Evidence turns “trust us” into “prove it.” With AethergenPlatform, CI makes proof routine: repeatable, signed, and ready for audit or procurement—no heroics required.