Weighing the pros and cons of synthetic healthcare data use
Healthcare generates a wealth of data, much of which is captured in provider notes and patient EHRs. In recent decades, healthcare organizations have begun not only to see the value of this information but also to use it to pursue key goals, such as improved outcomes.
However, data availability and quality, alongside concerns about patient privacy and HIPAA compliance, have hindered these efforts. To overcome these challenges, some in the healthcare industry are advocating for the use of synthetic data, information that mimics real-world data (RWD) without compromising privacy.
This primer will explore what synthetic data is and outline the opportunities and limitations of its use in healthcare analytics.
WHAT IS SYNTHETIC DATA?
Synthetic data can be readily understood through a comparison with its counterpart RWD. RWD, also known as real-world evidence (RWE), is characterized by the US Food and Drug Administration as “data relating to patient health status and/or the delivery of health care routinely collected from a variety of sources. Examples of RWD include data derived from electronic health records, medical claims data, data from product or disease registries and data gathered from other sources (such as digital health technologies) that can inform on health status.”
Similarly, RWE is defined as “the clinical evidence about the usage and potential benefits or risks of a medical product derived from analysis of RWD.”
As these definitions suggest, RWD plays a crucial role in informing providers and payers about the health status of individuals and populations. That information is invaluable in analytics efforts to advance medical science and bolster healthcare quality.
However, the value of healthcare data is inextricably linked to the fact that it contains extremely personal, private information that must be protected in line with HIPAA regulations. HIPAA compliance can present significant data access and use hurdles for medical researchers, creating a need for a viable alternative to RWD.
One of the proposed alternatives is synthetic data, which, as the name suggests, is artificially generated. Both synthetic data and RWD are key to the data generation step of the healthcare data lifecycle, but synthetic data can ease some of the burdens associated with data collection and preparation, like data quality.
RWD varies in quality and may require significant effort to harmonize and standardize before it can be used for an analytics process. Further, RWD must be appropriately de-identified to protect patient privacy.
The defining characteristics of synthetic data make it promising for use in sectors like healthcare, but those interested in using it must be mindful of both its promise and pitfalls.
BENEFITS OF SYNTHETIC DATA USE
Synthetic data avoids some of the challenges associated with RWD by mimicking it.
A 2024 article published in the Journal of AHIMA notes, “Synthetic data is non-reversible, artificially created data that replicates the statistical characteristics and correlations of real-world, raw data. Utilizing both discrete and non-discrete variables of interest, synthetic data does not contain identifiable information because it uses a statistical approach to create a completely new data set.”
By creating a data set free of personally identifiable information that maintains the same statistical properties as the RWD, synthetic data allows stakeholders to sidestep many potential privacy issues while streamlining much of the data preparation process.
“Synthetic data offers the potential to mimic the characteristics of a real data set, without sensitive patient information, making it a good option for analyzing large but sensitive samples of real individual-level patient data. Synthetic data differs from de-identified data in that it is built from scratch, as opposed to being based on individual patient records, which means synthetic data cannot be de-anonymized. Unlike de-identified data, synthetic data puts a protective layer around the original data to preserve both the privacy of the original and the underlying value of that data,” the article continues.
Patient privacy preservation is one of the biggest benefits of using synthetic data, but these data sets are also useful for preventing data re-identification and supporting algorithm training.
Researchers writing in npj Digital Medicine in March 2023 outlined how a patient-centric synthetic data generation approach used a local model to create “avatar data” based on individual biomedical information. The approach outperformed two other synthetic generation techniques in terms of privacy protection while retaining its statistical similarities with the real-world data set.
This method helps protect the original data and transform the information, thereby minimizing the risk of a privacy breach and preventing potential data re-identification.
A separate study appearing in last year’s October issue of npj Digital Medicine highlighted that the use of synthetic data could help explore policy implications and bolster the development of AI.
“Synthetic data has the potential to estimate the benefit of screening and healthcare policies, treatments, or clinical interventions, augment machine learning algorithms (e.g., image classification pipelines), pre-train machine learning models that can then be fine-tuned for specific patient populations, and improve public health models to predict outbreaks of infectious diseases,” the authors wrote.
SYNTHETIC DATA’S DRAWBACKS
Despite these benefits, synthetic data also presents significant potential challenges related to data quality, bias, and AI model collapse.
Leadership from John Snow Labs, writing in Forbes last year, asserted that “while synthetic data may be useful for demonstrating healthcare software user interfaces, it is currently not suitable for analytics, data science or training medical machine learning models” due to limitations stemming from data leakage, patient cohort generation, and bias.
Data leakage refers to when data from the test set is used during model training, which may artificially inflate a model’s performance.
Synthetic data typically lacks the “noise” present in RWD, and models trained on synthetic data may achieve much higher performance than other tools, potentially due to data leakage. In these models, the generated synthetic patient data is too similar to the test set data, which can lead to overly optimistic assessments of performance.
The use of synthetic data can also present challenges for generating patient cohorts. Currently, models often do well when tasked with generating a potential discharge summary for a single patient based on a given set of characteristics. However, asking the model to generate hundreds or thousands of summaries representative of an entire population is significantly more difficult.
Like other types of AI, generative algorithms are also prone to biases. Developers often posit that these tools are only as good as the data they are trained on, meaning that biases in the data will translate to biases in the model.
This holds true for both synthetic data and RWD, leaving those wishing to use synthetic data in their analyses faced with a conundrum: synthetic data generated by a model trained on biased RWD is likely to contain the same biases, which could perpetuate healthcare disparities.
Addressing this requires synthetic data generators to be trained on representative RWD, but creating and accessing high-quality, representative data presents a hurdle for many researchers and healthcare organizations.
Using synthetic data to train AI models can further contribute to model collapse, a phenomenon in which models trained on synthetically generated content begin to degrade over time, leading to dips in performance.
Much like tackling bias, tackling this requires that models be trained on human-produced data.
These issues are just two aspects of a more significant concern: the quality and validation of synthetic data generators.
In a May 2023 Computer Science Review article, the authors noted that “the literature shows the effectiveness of synthetic data sets for different [healthcare] applications in research, academics and testing according to existing statistical and task-based utility metrics. However, the focus on longitudinal synthetic data seems deficient. Moreover, a unified metric for generic quality assessment of synthetic data is lacking.”
Further, researchers writing in BMC Medical Research Methodology in 2020 indicated that a wide variety of models for synthetic data generation exist, but different models often rely on different metrics to evaluate the quality of the data produced, making comparisons difficult.
Effectively evaluating models is critical to ensuring that their output and performance meet the healthcare industry’s high standards for care and patient safety.
However, these challenges are not insurmountable.
A 2022 study published in JMIR Medical Informatics demonstrated that a generative model utility metric known as the multivariate Hellinger distance can successfully compare and rank synthetic data generation methods for logistic regression prediction models.
Advances like these have enabled researchers and health systems to begin utilizing synthetic data for various projects, including efforts to accelerate COVID-19 research and tackle neighborhood-based health disparities.
link