Synthetic Data 101 for Healthcare: Fraud Detection Without PHI/PII
By Gwylym Owen — 20–28 min read
Executive Summary
Healthcare fraud programs often stall when privacy rules block sharing, testing, and verification. AethergenPlatform supports a synthetic‑first pipeline that captures behavior without exposing identity. Teams can prototype, evaluate, and provide evidence with corpora that mirror real claims—PHI/PII‑free. This piece walks through the data model, generation process, typology library, evaluation gates, and evidence packaging to move from pilot to production in regulated zones as of September 2025.
Problem Context
Fraud typologies evolve faster than change‑control cycles. Rules become stale, models drift, and analyst workload increases. Real data cannot leave secure zones, so collaboration slows, tools remain untested, and vendor claims are difficult to verify. Synthetic corpora enable safe iteration, repeatable evaluation, and procurement‑grade evidence.
Clinical-Claims Data Model
Here’s the blueprint, simplified:
- Entities: Patient (de‑identified), provider, facility, payer plan, claim, line item, prescription, lab event.
- Core Attributes: Dates, CPT/HCPCS/ICD codes, NPI/specialty (synthetic), amounts (billed/allowed/paid), place of service, modifiers, units, referrals, prior‑auth flags.
- Relations: Provider↔facility affiliation, patient↔provider panels, claim↔line items, episodes of care, referral chains, pharmacy fill sequences.
Dataset Schema Example
claims(
claim_id, patient_id*, provider_id*, facility_id*, date, pos,
cpt, icd10, modifiers, units, amount_billed, amount_allowed, amount_paid,
payer_plan, referral_flag, prior_auth_flag
)
lines(
line_id, claim_id, cpt, icd10, modifiers, units, npi*, specialty,
amount_billed, amount_allowed, amount_paid
)
rx(
rx_id, patient_id*, provider_id*, date, drug_class, dose, days_supply,
payment_amount, device_id*
)
labs(
lab_id, patient_id*, loinc_code, result_band, units, date
)
*Synthetic identifiers only; no PHI/PII in evaluation corpora.
Metrics Appendix
Measure what matters:
- Lift@Budget: Cases found at fixed FPR (0.5%, 1%, 2%)—hit the target!
- Stability: Max segment delta across specialty/region/plan bands—keep it steady!
- Drift Early-Warning: Change-point scores for code usage and cadence—spot the shifts!
- Analyst Yield: Cases per analyst-hour at chosen thresholds—maximize efficiency!
- Privacy: Membership/attribute probe advantage vs random—lock it down!
Acceptance Checklist
Checklist:
- Target KPI: Defined and tied to analyst staffing—set the goal!
- Operating Points: Fixed with CI bands reported—clear metrics!
- Privacy Probes: Under thresholds (DP budgets if used)—secure it!
- Drift Monitors: Documented rollback rules—plan for change!
- Evidence Bundle: Signed and archived—proof ready!
Use Case Example
Scenario: A simulated regional upcoding sweep for a payer.
A four‑week pilot on orthopedic claims could lock operating points at 1% FPR. A synthetic‑trained baseline might boost reviewed‑case yield versus legacy rules while reducing false escalations. Stability could hold across regions; privacy probes would be monitored. Procurement proceeds only if gates pass and rollback is defined.
FAQ
Does synthetic data replace real investigations?
No. Synthetic data is for testing and development; real investigations confirm outcomes.
Can we tune prevalence to stress analysts?
Yes. Typologies are adjustable to simulate workload and policy trade‑offs.
How do you prevent overfitting to synthetic quirks?
We use ablation to ensure robust features win; sanity-check rankings on approved internal samples; and lay out limits in the evidence bundle—keeping it real!
Glossary
- Operating Point: Threshold where the detector runs in production—your line in the sand!
- Evidence Bundle: Signed package of metrics, configs, seeds, and hashes—your proof pack!
- Membership Inference: Attack checking if a record influenced training—test the leaks!
- DP: Differential privacy; limits any one record’s impact—privacy power!
Procurement Q&A
- Export Formats: Parquet/Delta; dashboards as HTML/PDF; notebooks as HTML—flexible delivery!
- Runtime: On-prem preferred; VPC supported; no PHI/PII leaves the enclave—secure as can be!
- Support: SLAs for evidence regen and drift incident triage—got your back!
Contact
Ready to tackle fraud detection safely and effectively? Contact us about a focused pilot.
Generation Pipeline
Let’s build it step by step:
- Schema Design: Pick fields and ranges for fraud utility; encode CPT families, ICD hierarchies—set the stage!
- Distribution Learning: Learn marginals and joint structure from seeds/redacted aggregates; fit copulas or conditionals—keep dependencies alive!
- Sequence Synthesis: Craft episode timelines (admissions→procedures→discharge; refills) with realistic timing and seasonality—tell the story!
- Typology Injection: Add parameterised fraud behaviors (below) with tunable prevalence and severity—mix it up!
- Validation: Run fidelity and privacy checks; tweak until thresholds pass—make it solid!
Fidelity: What “Good Enough” Means
Here’s the standard:
- Marginals: Code, specialty, geography, amount distributions within tolerances—align it!
- Joints: Realistic provider-procedure-amount ties; plausible co-coding; age-procedure fits—make sense!
- Temporal: Weekday/season effects; episode lengths; refill cadences; denial/rebill loops—capture the flow!
- Tail Coverage: Keep rare but plausible events; cap impossibles with constraints—cover the edges!
- Utility Checks: Baseline detectors on synthetic hit target lift on hold-out synthetic and match rankings on approved samples—prove it works!
Privacy: What We Measure
No assumptions here:
- Process Isolation: Eval corpora have no PHI/PII; seeds are minimal and controlled—locked tight!
- Membership Inference: Attacks show low advantage at release settings—test it!
- Attribute Disclosure: Sensitive prediction stays at/below baseline leakage—safe bets!
- Differential Privacy (Optional): Per-policy ε, δ budgets with impact notes when mandated—privacy with power!
Fraud Typology Library
Tune these fraud flavors:
- Upcoding: Inflate CPT within specialties; adjust overbilling factor, code families, audit risk—catch the cheats!
- Unbundling: Split components; tweak compliance pressure and recurrence—spot the splits!
- Phantom Billing: Claims sans service; vary facility mix, distance anomalies, timing clashes—phantom busters!
- Doctor Shopping: Overlapping scripts; control window, drug class, device ties—track the hoppers!
- Duplicate Billing: Repeat claims with modifiers; adjust delay and payer rules—double trouble!
- Kickback Rings: Referral cycles with odd financials; expose graph motifs and flows—ring the alarm!
Feature Families
Build the signals:
- Code Semantics: Family distance, incompatible pairs, specialty fits—code savvy!
- Temporal: Visit cadence, inter-arrival z-scores, day/week effects—time it right!
- Financial: Amount residuals vs peers, payer mix quirks, denial/rebill patterns—money talks!
- Graph: Provider-patient motifs, referral cycles, shared devices—connect the dots!
Modeling and Thresholds
Keep it clear and strong:
Use transparent baselines (rules, tree ensembles) with deep models for lift and interpretability. Pick operating points matching investigator capacity (alerts/day/team) and share the trade-off. Pilots often target “+X% cases at fixed FPR” over raw AUC—practical wins!
Evaluation Gates
Procurement-ready checks:
- Operating Point Utility: Detection at fixed FPR budgets with CIs—hit the mark!
- Segment Stability: Specialty, region, plan type deltas within bounds—steady as she goes!
- Drift Sensitivity: Early-warning KPIs under simulated shifts—stay alert!
- Analyst Cost Curves: Incremental cases per analyst-hour—maximize effort!
- Privacy Gates: Probes under thresholds; DP budgets honored if used—secure it!
Evidence Bundle
Shipped with every release:
- Hashes: Schema, recipe, environment; SBOM for artifacts—full trace!
- Fidelity Metrics: CIs, visual comparisons for marginals/joints—see the fit!
- Privacy Results: Probes and interpretations; DP params if used—lock it down!
- Ablation Table: Feature/recipe impacts—keep the winners!
- Notes: Use, limits, drift monitors, rollback rules, change-control—cover all bases!
Pilot → Policy in Four Weeks
Your roadmap:
- Week 1: Pick a typology and KPI; lock schema; generate v1 corpora—kick it off!
- Week 2: Train baselines; tune operating points; check fidelity/privacy—refine it!
- Week 3: Red-team failures; set gates; package evidence—test the edges!
- Week 4: Run acceptance with stakeholders; sign off thresholds and rollback—seal the deal!
Integration
Plug it in:
- Deployment: On-prem or private cloud; edge bundles for air-gapped review—flexible fit!
- Data Interchange: Parquet/Delta; Databricks-ready jobs—smooth flow!
- Governance: Signed evidence bundles in CI/CD and doc control—locked tight!
AethergenPlatform prioritizes evidence. You get realism without identifiers, utility at your alert budget, and reproducible packages suitable for audit.
Contact Sales →
Worked Example: Upcoding Review
Let’s break it down:
- Setup: Define CPT family and specialty constraints; set cost bands—lay the ground!
- Generate: Create cohort with tunable upcoding prevalence and factor—mix it up!
- Train: Baseline trees + rules; calibrate to 1% FPR—tune it tight!
- Report: Cases per analyst-hour with explanations and feature contributions—show the win!
Worked Example: Doctor Shopping
Another go:
- Emit: Overlapping prescription sequences across providers and pharmacies—build the case!
- Construct: Device/address correlations and cadence features—connect the dots!
- Compare: Rules vs learned models; tweak escalation policies—find the best!
- Publish: Stability across regions and plan types—prove it holds!
Analyst Experience
Make it easy:
- Reproducible Notebooks: Tied to evidence bundle versions—trace it back!
- Scenario Sliders: Adjust prevalence/severity without coding—play with it!
- One-Click Reruns: Fixed seeds for training—quick and consistent!
Limits and Non-Goals
Know the boundaries:
- No Perfect Match: Synthetic doesn’t mimic micro-populations perfectly; we measure fidelity and disclose limits—honest approach!
- No PHI/PII: Eval corpora stay identifier-free—no leaks!
- No Fragile Heuristics: Features validated via ablation—robust wins!
Next Steps
Let’s get moving:
- Pick Typology: Choose one and a KPI; plan a two-week pilot—start small!
- Align Gates: Match acceptance with investigator capacity—set the bar!
- Review Evidence: Check the draft bundle before promotion—finalize it!