Auspexi

Synthetic Data 101 for Healthcare: Fraud Detection Without PHI/PII

By Gwylym Owen — 20–28 min read

Executive Summary

Healthcare fraud programs often stall when privacy rules block sharing, testing, and verification. AethergenPlatform supports a synthetic‑first pipeline that captures behavior without exposing identity. Teams can prototype, evaluate, and provide evidence with corpora that mirror real claims—PHI/PII‑free. This piece walks through the data model, generation process, typology library, evaluation gates, and evidence packaging to move from pilot to production in regulated zones as of September 2025.

Problem Context

Fraud typologies evolve faster than change‑control cycles. Rules become stale, models drift, and analyst workload increases. Real data cannot leave secure zones, so collaboration slows, tools remain untested, and vendor claims are difficult to verify. Synthetic corpora enable safe iteration, repeatable evaluation, and procurement‑grade evidence.

Clinical-Claims Data Model

Here’s the blueprint, simplified:

Dataset Schema Example

claims(
  claim_id, patient_id*, provider_id*, facility_id*, date, pos,
  cpt, icd10, modifiers, units, amount_billed, amount_allowed, amount_paid,
  payer_plan, referral_flag, prior_auth_flag
)

lines(
  line_id, claim_id, cpt, icd10, modifiers, units, npi*, specialty,
  amount_billed, amount_allowed, amount_paid
)

rx(
  rx_id, patient_id*, provider_id*, date, drug_class, dose, days_supply,
  payment_amount, device_id*
)

labs(
  lab_id, patient_id*, loinc_code, result_band, units, date
)

*Synthetic identifiers only; no PHI/PII in evaluation corpora.
  

Metrics Appendix

Measure what matters:

Acceptance Checklist

Checklist:

Use Case Example

Scenario: A simulated regional upcoding sweep for a payer.

A four‑week pilot on orthopedic claims could lock operating points at 1% FPR. A synthetic‑trained baseline might boost reviewed‑case yield versus legacy rules while reducing false escalations. Stability could hold across regions; privacy probes would be monitored. Procurement proceeds only if gates pass and rollback is defined.

FAQ

Does synthetic data replace real investigations?

No. Synthetic data is for testing and development; real investigations confirm outcomes.

Can we tune prevalence to stress analysts?

Yes. Typologies are adjustable to simulate workload and policy trade‑offs.

How do you prevent overfitting to synthetic quirks?

We use ablation to ensure robust features win; sanity-check rankings on approved internal samples; and lay out limits in the evidence bundle—keeping it real!

Glossary

Procurement Q&A

Contact

Ready to tackle fraud detection safely and effectively? Contact us about a focused pilot.

Generation Pipeline

Let’s build it step by step:

  1. Schema Design: Pick fields and ranges for fraud utility; encode CPT families, ICD hierarchies—set the stage!
  2. Distribution Learning: Learn marginals and joint structure from seeds/redacted aggregates; fit copulas or conditionals—keep dependencies alive!
  3. Sequence Synthesis: Craft episode timelines (admissions→procedures→discharge; refills) with realistic timing and seasonality—tell the story!
  4. Typology Injection: Add parameterised fraud behaviors (below) with tunable prevalence and severity—mix it up!
  5. Validation: Run fidelity and privacy checks; tweak until thresholds pass—make it solid!

Fidelity: What “Good Enough” Means

Here’s the standard:

Privacy: What We Measure

No assumptions here:

Fraud Typology Library

Tune these fraud flavors:

Feature Families

Build the signals:

Modeling and Thresholds

Keep it clear and strong:

Use transparent baselines (rules, tree ensembles) with deep models for lift and interpretability. Pick operating points matching investigator capacity (alerts/day/team) and share the trade-off. Pilots often target “+X% cases at fixed FPR” over raw AUC—practical wins!

Evaluation Gates

Procurement-ready checks:

Evidence Bundle

Shipped with every release:

Pilot → Policy in Four Weeks

Your roadmap:

  1. Week 1: Pick a typology and KPI; lock schema; generate v1 corpora—kick it off!
  2. Week 2: Train baselines; tune operating points; check fidelity/privacy—refine it!
  3. Week 3: Red-team failures; set gates; package evidence—test the edges!
  4. Week 4: Run acceptance with stakeholders; sign off thresholds and rollback—seal the deal!

Integration

Plug it in:

AethergenPlatform prioritizes evidence. You get realism without identifiers, utility at your alert budget, and reproducible packages suitable for audit.

Contact Sales →

Worked Example: Upcoding Review

Let’s break it down:

  1. Setup: Define CPT family and specialty constraints; set cost bands—lay the ground!
  2. Generate: Create cohort with tunable upcoding prevalence and factor—mix it up!
  3. Train: Baseline trees + rules; calibrate to 1% FPR—tune it tight!
  4. Report: Cases per analyst-hour with explanations and feature contributions—show the win!

Worked Example: Doctor Shopping

Another go:

  1. Emit: Overlapping prescription sequences across providers and pharmacies—build the case!
  2. Construct: Device/address correlations and cadence features—connect the dots!
  3. Compare: Rules vs learned models; tweak escalation policies—find the best!
  4. Publish: Stability across regions and plan types—prove it holds!

Analyst Experience

Make it easy:

Limits and Non-Goals

Know the boundaries:

Next Steps

Let’s get moving: