Benchmark Study

Why Raw Synthetic Data Fails Clinical AI

Commodity synthetic generators score 65-75% on Train-Synthetic-Test-Real benchmarks. We quantify the gap across clinical domains and demonstrate how hybrid correction closes it.

Published April 13, 2026Download PDF

Synthetic data is widely promoted as a solution to healthcare AI's chronic data access problem. Vendors claim that algorithmically generated patient records can replace real clinical data for model training, offering privacy protection and unlimited scale. This paper examines that claim against published evidence and finds it wanting.

Commodity synthetic data generators — including rule-based engines like Synthea and uncalibrated generative adversarial networks — consistently score between 65% and 75% on Train-on-Synthetic, Test-on-Real (TSTR) benchmarks. Models trained on their output underperform by 25–35% compared to models trained on real patient data. A comprehensive benchmarking study across MIMIC-III and MIMIC-IV found AUC drops of 0.063 to 0.268 depending on the generator and task.

We identify five systemic failure modes: distribution errors that misrepresent disease prevalence across cardiology, endocrinology, oncology, and pediatrics; correlation breakdown that flattens the clinical relationships between diagnoses, labs, and medications; temporal artifacts that impose artificial regularity on care timing; missing clinical context that strips away physician reasoning and narrative documentation; and privacy theater that offers the appearance of confidentiality without formal guarantees.

Stadler, Oprisanu, and Troncoso demonstrated in a landmark 2022 USENIX Security paper that synthetic data provides a "false sense of privacy" — their evaluation showed that synthetic data either does not prevent inference attacks or does not retain data utility, and often fails at both.

The 65–75% TSTR ceiling is the difference between a toy and a tool. Closing the gap to 94%+ requires addressing each failure mode systematically through statistical correction, clinical enrichment, temporal calibration, privacy engineering, and rigorous multi-layer validation. This paper presents published evidence for each failure mode and the five questions every hospital data team should ask synthetic data vendors.

Full paper available

Download the complete white paper with methodology details, references, and supplementary data.

Download Full Paper

Related Research

Methodology

Hybrid Pipeline Achieves 94% TSTR Fidelity

Combining structural generation with GAN correction and LLM enrichment produces synthetic data indistinguishable from real clinical records in downstream ML tasks.

Validation Framework

6-Layer Automated Validation for Synthetic Clinical Data

A comprehensive quality framework spanning statistical fidelity, clinical pathway accuracy, temporal consistency, TSTR utility, NLP coherence, and differential privacy guarantees.

Questions about our methodology?

We welcome collaboration with health systems, academic researchers, and AI teams.