Auspexi

The Synthetic Data Lifecycle: From Seeds to Evidence

By Gwylym Owen — 18–24 min read

Executive Summary

AethergenPlatform supports schema design, generation, validation, and evidence packaging—enabling delivery without PHI/PII, with artifacts buyers can evaluate and adopt as of September 2025.

Lifecycle Stages

The process:

  1. Schema: Define entities, relations, vocabularies, and constraints—set the stage!
  2. Seeds: Minimal/redacted aggregates to learn structure—plant the seeds!
  3. Generation: Craft realistic corpora with parameterised scenarios—create the illusion!
  4. Validation: Fidelity/utility metrics with CIs—validate results.
  5. Privacy: Probes and optional DP budgets—keep it secret, keep it safe!
  6. Packaging: Parquet/Delta, notebooks, and samples—wrap it up!
  7. Evidence: Signed bundle with metrics/configs/seeds/hashes—seal the deal!

Design Patterns

Practical patterns:

KPIs

Metrics to monitor:

Evidence Integration

Operational integration:

Case Study

Scenario: An insurer’s staged rollout.

An insurer could use the lifecycle to ship a claims corpus plus detectors. Evidence might show stability across regions and include a rollback SOP. Procurement could sign off within a simulated two‑week cycle as of September 2025.

FAQ

Can we skip seeds?

Seeds (or aggregates) anchor realism. If starting with public priors, we can tighten later as policy allows.

What about rare classes?

Scenario overlays and targeted augmentations help preserve tails; limits are disclosed in evidence.

Can we export dashboards?

Yes—HTML/PDF artifacts are bundled for offline review.

How do we handle regulator audits?

We deliver reproducible artifacts and documented limits; sensitive data stays on your infrastructure.

Can we attach private annexes for regulators?

Yes—annexes can ship with independent manifests; public bundles link without exposing sensitive content.

Glossary

Checklist

Cast it right:

Contact Sales →

Recipe Manifest

recipe:
  schema: schemas/claims_v3.yaml
  generator: copula+sequence
  scenarios:
    - upcoding: {prevalence: 0.03, factor: 1.2}
    - duplicate_billing: {delay_days: 7}
  outputs: parquet
  

Validation Dashboard Contents

Peek inside the crystal ball:

Privacy Probes

Guard the secrets:

CI Example

steps:
  - generate_small
  - validate
  - run_probes
  - evidence_bundle
artifacts: [parquet, metrics.json, plots.html, manifest.json]
  

Evidence Manifest

{
  "version": "2025.01",
  "artifacts": ["metrics.json", "plots.html", "sbom.json"],
  "hashes": {"metrics.json": "..."}
}
  

Runbook

Checklist:

  1. Change detected → regenerate evidence—cast anew!
  2. Compare against gates; if fail, fix or revert.
  3. Attach bundle to change-control; notify stakeholders.

Risks & Mitigations

Defend the realm:

Procurement Checklist

Seal the deal:

Schema Designer

Craft the foundation:

Seeds Policy

Guard the roots:

Generation Deep Dive

Unleash the creativity:

Validation Details

Test the potion:

Privacy Details

Security:

Packaging Details

Packaging:

Evidence Details

Seal the scroll:

Case Studies

More examples:

Templates

acceptance:
  bundle_id: string
  op: string
  stability: string
  privacy: string
  latency: string
  decision: APPROVE|REJECT
  

Detailed Schema Catalog

entities:
  Patient: {id: string, age: int, region: enum[NA,EU,APAC]}
  Provider: {id: string, specialty: enum, region: enum}
  Facility: {id: string, type: enum, region: enum}
  Claim: {id: string, patient_id: ref Patient.id, provider_id: ref Provider.id,
          facility_id: ref Facility.id, date: date, pos: enum, amount: decimal}
  LineItem: {id: string, claim_id: ref Claim.id, cpt: string, icd10: string, units: int}
relations:
  Patient 1..* Claim
  Claim 1..* LineItem
constraints:
  Claim.amount >= 0
  LineItem.units > 0
vocabularies:
  CPT_v12: {...}
  ICD10_subset: {...}
  

Reference Constraints

Keep it grounded:

Entity–Relationship Examples

Patient(id) ──< Claim(id) ──< LineItem(id)
   │                 │                \
   └── region       └── provider_id ──> Provider(id)
                                 \
                                  └─ facility_id ──> Facility(id)
  

Seeds Governance Checklist

Protect the source:

Generation Parameter Tables

param, default, min, max, note
amount.ln_mu, 4.1, 3.8, 4.6, log-normal mean
amount.ln_sigma, 0.7, 0.5, 0.9, tail width
interarrival.lambda1, 0.3, 0.1, 0.6, short gap component
interarrival.lambda2, 0.8, 0.4, 1.2, long gap component
mix.weight, 0.4, 0.2, 0.6, mixture proportion
  

Overlay Library

overlays:
  upcoding: {prevalence: 0.03, factor: 1.2}
  unbundling: {prevalence: 0.01}
  phantom_provider: {distance_km: >150, time_collision: true}
  duplicate_billing: {delay_days: 7}
  doctor_shopping: {window_days: 14, device_reuse: 0.25}
  

Overlay Composition Rules

Blend with care:

Validation Worksheets

field, ks_pvalue, pass
amount, 0.21, yes
units, 0.34, yes
pos, 0.08, borderline (flag)
  

Operating Point Selection

Given budget alerts/day = B and volume/day = V, choose threshold θ s.t.
FPR(θ) ≈ B / V. Validate precision/recall at θ with CIs.
  

Effect Size Computation

base = evaluate(cfg_base)
for factor in factors:
  cfg = tweak(cfg_base, factor)
  result = evaluate(cfg)
  delta = result.kpi_op - base.kpi_op
  ci = bootstrap_ci(result - base)
  record(factor, delta, ci)
  

Drift Monitors

monitors:
  input_psi:
    fields: [amount, pos]
    threshold: 0.2
  outcome_delta:
    by_segment: [region, product]
    threshold: 0.05
  

Privacy Methodology

Guard the vault:

Packaging Artifacts Catalog

Deliver the goods:

Evidence Manifests

{
  "version": "2025.01",
  "artifacts": ["metrics/utility@op.json", "plots/stability_bars.html"],
  "hashes": {"metrics/utility@op.json": "sha256:..."},
  "env": {"python": "3.11", "numpy": "1.26.4"}
}
  

Unity Catalog Comments

COMMENT ON TABLE prod.ai.claims IS 'Purpose: fraud triage; OP: fpr=1%; Evidence: manifest 2025.01.';
  

Buyer Notebook

# 1) Load sample table
# 2) Run UDF at OP threshold
# 3) Compute OP metrics with CIs
# 4) Review stability summary
  

Audit File Tree

release_2025_01/
├─ metrics/
├─ plots/
├─ configs/
├─ privacy/
├─ sbom.json
├─ manifest.json
└─ README.html
  

Risk Register

risk, likelihood, impact, control, owner
tail_undercoverage, med, med, overlays+limits, data_lead
probe_regression, low, high, gates+waiver_policy, privacy_lead
  

SLA Mapping

Operational expectations:

Closing

Schema, seeds, generation, overlays, validation, privacy, packaging, evidence—each stage forms a repeatable process. AethergenPlatform ensures the trail is auditable, so you can progress from pilot to production with confidence.