Harnessing the power of synthetic data in healthcare: innovation, application, and privacy
Assefa, S. Generating Synthetic Data in Finance: Opportunities, Challenges and Pitfalls. Available at SSRN: (2020).
Gonzales, A., Guruswamy, G. & Smith, S. R. Synthetic data in health care: A narrative review. PLOS Digital Health 2, e0000082 (2023).
Google Scholar
McDuff, D., Curran T. & Kadambi, A. Synthetic Data in Healthcare. arXiv preprint arXiv:2304.03243 (2023).
Gotz, D. & Borland, D. Data-driven healthcare: challenges and opportunities for interactive visualization. IEEE computer Graph. Appl. 36, 90–96 (2016).
Google Scholar
Jordon J. et al. Weller Adrian. Synthetic Data – what, why and how? arXiv: 2205.03257 [cs], (2022).
Philpott, D. A Guide to Federal Terms and Acronyms: Bernan Press; (2017)
Metropolis, N. & Ulam, S. The Monte Carlo method. J. Am. Stat. Assoc. 44, 335–341 (1949).
Google Scholar
Goodfellow, Ian et al. Generative adversarial networks. Commun. ACM 63, 139–144 (2020).
Google Scholar
Diederik, P. Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, (2013).
Eric Bonabeau Agent-based modeling: Methods and techniques for simulating human systems. Proc. Natl Acad. Sci. 99, 7280–7287 (2002).
Google Scholar
Carmona, R. and Delarue, F. Probabilistic Theory of Mean Field Games with Applications, volume 84. Springer (2018).
Walonoski, J., et al. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J Am Med Inform Assoc. Epub 2017/10/13. PMID: 29025144 (2017).
MDClone Launches New Phase of Collaboration with Washington University in St. Louis. [cited 31 October 2019]. In: MDClone News [Internet]. Available from: (2019).
Reiter, J. Inference for partially synthetic, public use microdata sets. Surv. Methodol. 29, 181–188 (2003).
Loong, B., Zaslavsky, A. M., He, Y. & Harrington, D. P. Disclosure control using partially synthetic data for large-scale health surveys, with applications to CanCORS. Stat. Med 32, 4139–4161 (2013).
Google Scholar
Raghunathan, T., Reiter, J. & Rubin, D. Multiple imputation for statistical disclosure limitation. J. Stat. 19, 1–16 (2003).
Reiner Benaim, A. et al. Analyzing medical research results based on synthetic data and their relation to real data results: systematic comparison from five observational studies. JMIR Med Inf. 8, e16492 (2020).
Google Scholar
Ngufor, C., Van Houten, H., Caffo, B. S., Shah, N. D. & McCoy, R. G. Mixed effect machine learning: a framework for predicting longitudinal change in hemoglobin A1c. J. Biomed. Inf. 89, 56–67 (2019).
Google Scholar
Enanoria, W. T. et al. The effect of contact investigations and public health interventions in the control and prevention of measles transmission: a simulation study. PLoS ONE 11, e0167160 (2016).
Google Scholar
Laderas, T. et al. Teaching data science fundamentals through realistic synthetic clinical cardiovascular data. bioRxiv. 232611. (2017).
Harron, K., Gilbert, R., Cromwell, D. & Van Der Meulen, J. Linking data for mothers and babies in de-identified electronic health data. PLoS One. 11. (2016).
Ringel, J. S., Eibner, C., Girosi, F., Cordova, A. & McGlynn, E. A. Modeling health care policy alternatives. Health Serv. Res 45, 1541–1558 (2010).
Google Scholar
Aljaaf, A. J. et al. Partially synthesised dataset to improve prediction accuracy. In: Huang D. S., Bevilacqua V., Premanratne P., editors. Intelligent Computing Theories and Application. Switzerland: Springer Cham. p. 855–866 (2016).
Amoon, A. T., Arah, O. A. & Kheifets, L. The sensitivity of reported effects of EMF on childhood leukemia to uncontrolled confounding by residential mobility: a hybrid simulation study and an empirical analysis using CAPS data. Cancer Causes Control 30, 901–908 (2019).
Google Scholar
Symonds, P. et al. MicroEnv: a microsimulation model for quantifying the impacts of environmental policies on population health and health inequalities. Sci. Total Environ. 697, 134105 (2019).
Google Scholar
Hennessy, D. Creating a synthetic database for use in microsimulation models to investigate alternative health care financing strategies in Canada. Int J. Microsimul 8, 41–74 (2015).
Sun, Z., Wang, F., Hu, J. LINKAGE: An approach for comprehensive risk prediction for care management. In: Cao L., Zhang C., editors. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Sydney, Australia. New York: Association for Computing Machinery; 2015. 1145–1154 (2015).
Davis, P., Lay-Yee, R. & Pearson, J. Using micro-simulation to create a synthesised data set and test policy options: the case of health service effects under demographic ageing. Health Policy 97, 267–274 (2010).
Google Scholar
Ive, J. et al. Generation and evaluation of artificial mental health records for natural language processing. NPJ digital Med. 3, 1–9 (2020).
Google Scholar
Jiang, Y., Chen, H., Loew, M., Ko, H. COVID-19 CT Image Synthesis with a Conditional Generative Adversarial Network. arXiv: arXiv:2007.14638 (2020)
Das, H. P. et al Conditional Synthetic Data Generation for Robust Machine Learning Applications with Limited Pandemic Data arXiv:2109.0648609.06486arXiv:2109.06486Top of FormBottom of Form
Cheng, W., Lian, W. & Tian, J. Building the hospital intelligent twins for all-scenario intelligence health care. DIGITAL HEALTH 8. (2022)
Karakra, A., Fontanili, F., Lamine, E. & Lamothe, J. “HospiT’Win: A Predictive Simulation-Based Digital Twin for Patients Pathways in Hospital,” 2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), Chicago, IL, USA, pp. 1-4, (2019).
Cockrell, C., Schobel-McHugh, S., Lisboa, F., Vodovotz, Y., An, G. Generating synthetic data with a mechanism-based Critical Illness Digital Twin: Demonstration for Post Traumatic Acute Respiratory Distress Syndrome. bioRxiv 2022.11.22.517524.
Filippo, M. D. et al. Single-Cell Digital Twins for Cancer Preclinical Investigation. Methods Mol. Biol. (Clifton NJ) 2088, 331–343 (2020).
Google Scholar
Zhang, J., Qian, H. & Zhou, H. Application and Research of Digital Twin Technology in Safety and Health Monitoring of the Elderly in Community. Zhongguo Yi Liao Qi Xie Za Zhi Chin. J. Med Instrum. 43, 410–413 (2019).
Hose, D. R. et al. Cardiovascular Models for Personalised Medicine: Where Now and Where Next? Med Eng. Phys. 72, 38–48 (2019).
Google Scholar
Pencina, M. J., Goldstein, B. A. & D’Agostino, R. B. N. Engl. J. Med. 382, 1583 (2020).
Google Scholar
Norori, N., Hu, Q., Aellen, F. M., Faraci, F. D. & Tzovara, A. Addressing bias in big data and AI for health care: A call for open science. Patterns (N. Y) 2(Oct), 100347 (2021).
Google Scholar
Naeem, M. F., Oh, S. J., Uh, Y., Choi, Y. & Yoo, J. In International Conference on Machine Learning, 7176–7185 (PMLR, 2020).
Sajjadi, M. S., Bachem, O., Lucic, M., Bousquet, O. & Gelly, S. In Advances in Neural Information Processing Systems (2018).
Alaa, A. M., van Breugel, B., Saveliev, E. & van der Schaar, M. In International Conference on Machine Learning (2021).
Möller, F. et al. Out-of-distribution Detection and Generation using Soft Brownian Offset Sampling and Autoencoders. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, pp. 46-55. (2021).
Chen, G. et al. Learning Open Set Network with Discriminative Reciprocal Points. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J. M. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science, vol 12348. Springer, Cham. (2020).
Molnar, C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. [Independently Published] (2022).
Lenatti, M., Paglialonga, A., Orani, V., Ferretti, M. & Mongelli, M.“Characterization of Synthetic Health Data Using Rule-Based Artificial Intelligence Models,” in IEEE Journal of Biomedical and Health Informatics. https://doi.org/10.1109/JBHI.2023.3236722.
Ghaffar Nia, N., Kaplanoglu, E. & Nasab, A. Evaluation of artificial intelligence techniques in disease diagnosis and prediction. Discov. Artif. Intell. 3, 5 (2023).
Google Scholar
Celino, I. Who is this Explanation for? Human Intelligence and Knowledge Graphs for eXplainable AI. arXiv: 2005.13275 (2020).
Hatherley, J., Sparrow, R., Howard, M. (2022). The Virtues of Interpretable Medical Artificial Intelligence. Camb Q Healthc Ethics:1-10. https://doi.org/10.1017/S0963180122000305.
Courtois, M., Filiot, A., & Ficheur, G. Distribution-Based Similarity Measures Applied to Laboratory Results Matching. In Applying the FAIR Principles to Accelerate Health Research in Europe in the Post COVID-19 Era (pp. 94-98). IOS Press (2021).
Xia, Y. Correlation and association analyses in microbiome study integrating multiomics in health and disease. Prog. Mol. Biol. Transl. Sci. 171, 309–491 (2020).
Google Scholar
Reddy, G. T. et al. Analysis of dimensionality reduction techniques on big data. Ieee Access 8, 54776–54788 (2020).
Google Scholar
Alur, R. et al. Auditing for Human Expertise. arXiv: 2306.01646 (2023).
Vivian Lai, S et al. Human-AI Collaboration via Conditional Delegation: A Case Study of Content Moderation. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ‘22). Association for Computing Machinery, New York, NY, USA, Article 54, 1–18. (2022).
Tewari, A. mHealth Systems Need a Privacy-by-Design Approach: Commentary on “Federated Machine Learning, Privacy-Enhancing Technologies, and Data Protection Laws in Medical Research: Scoping Review”. J. Med. Internet Res. 25, e46700 (2023).
Google Scholar
Arora, A. & Arora, A. Synthetic patient data in health care: a widening legal loophole. Lancet 399(Apr), 1601–1602 (2022).
Google Scholar
Appenzeller, A., Leitner, M., Philipp, P., Krempel, E. & Beyerer, J. Privacy and Utility of Private Synthetic Data for Medical Data Analyses. Appl. Sci. 12, 12320 (2022).
Google Scholar
Mendelevitch, O., & Lesh, M. D. Fidelity and privacy of synthetic medical data. arXiv preprint arXiv:2101.08658.(2021).
Sweeney, L. K-ANONYMITY: A MODEL FOR PROTECTING PRIVACY. Int. J. Uncertain., Fuzziness Knowl.-Based Syst. 10(Oct.), 557–570 (2002).
Google Scholar
Henriksen-Bulmer, J. & Jeary, S. Re-Identification Attacks—A Systematic Literature Review. Int. J. Inf. Manag. 36(Dec.), 1184–1192 (2016).
Google Scholar
Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. K. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 5(Jun), 493–497 (2021).
Google Scholar
US Food and Drug Administration. (n.d.). Artificial intelligence and machine learning in software as a medical device. https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-software-medical-device.
Brauneck, A. et al. Federated machine learning, privacy-enhancing technologies, and data protection laws in medical research: scoping review. J. Med Internet Res 25, e41588 (2023).
Google Scholar
Dwork, C. Differential Privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds) Automata, Languages and Programming. ICALP 2006. Lecture Notes in Computer Science, vol 4052. Springer, Berlin, Heidelberg. (2006).
Varma, G., Chauhan, R. & Singh, D. Sarve: synthetic data and local differential privacy for private frequency estimation. Cybersecurity 5, 26 (2022).
Google Scholar
Bao, E., Xiao, X., Zhao, J., Zhang, D., & Ding, B. Synthetic data generation with differential privacy via Bayesian networks. Journal of Privacy and Confidentiality 11. (2021).
Rosenblatt, L. et al. Differentially Private Synthetic Data: Applied Evaluations and Enhancements. arXiv:2011.05537
Dwork, C., Kohli, N. & Mulligan, D. Differential privacy in practice: expose your epsilons. JPC. 9 (2019).
Ficek, J., Wang, W., Chen, H., Dagne, G. & Daley, E. Differential privacy in health research: a scoping review. J. Am. Med Inf. Assoc. 28, 2269–2276 (2021).
Google Scholar
Jordon, J., Yoon, J., & Van Der Schaar, M. PATE-GAN: Generating synthetic data with differential privacy guarantees. In International conference on learning representations. (2019).
Xie, L., Lin, K., Wang, S., Wang, F., Zhou, J. Differentially Private Generative Adversarial Network. arXiv:1802.06739.
Patel, J. & Bhatt, N. Review of digital image forgery detection. Int. J. Recent Innov. Trends Comput. Commun. 5, 152–155 (2017).
Sadiku, M., Shadare, A. & Musa, S. Digital chain of custody. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 7, 117–118 (2017).
Hamid, A. & Naaz, R. Forensic-chain: Blockchain based digital forensics chain of custody with POC in hyperledger composer. Int. J. Digit. Investig. 28, 44–55 (2019).
Google Scholar
Wang, S., Yang, M., Ge, T., Luo, Y. and Fu. X. BBS: A Blockchain Big-Data Sharing System. ICC 2022 – IEEE International Conference on Communications, Seoul, Korea, Republic of, pp. 4205-4210, (2022).
link