July 15, 2024

Health Benefit

Healthy is Rich, Today's Best Investment

4 Emerging Strategies to Advance Big Data Analytics in Healthcare

10 min read

While the potential for big data analytics in healthcare has been a hot topic in recent years, the possible risks of using these tools have received just as much attention.

Big data analytics technologies have demonstrated their promise in enhancing multiple areas of care, from medical imaging and chronic disease management to population health and precision medicine. These algorithms could increase the efficiency of care delivery, reduce administrative burdens, and accelerate disease diagnosis.

But despite all the good these tools could achieve, the harm these algorithms could cause is nearly as significant.

Concerns about data access and collection, implicit and explicit bias, and issues with patient and provider trust in analytics technologies have hindered the use of these tools in everyday healthcare delivery.

Healthcare researchers and provider organizations are working to solve these issues, facilitating the use of big data analytics in clinical care for better quality and outcomes.

In this primer, HealthITAnalytics will explore how improving data quality, addressing bias, prioritizing data privacy, and building providers’ trust in analytics tools can advance the four types of big data analytics in healthcare.


In healthcare, it’s widely understood that the success of big data analytics tools depends on the value of the information used to train them. Algorithms trained on inaccurate, poor-quality data can yield erroneous results, leading to inadequate care delivery.

However, obtaining quality training data is complex and time-intensive, leaving many organizations without the resources to build effective models.

Researchers across the industry are working to overcome this challenge.

Data Availability

In 2019, a team from MIT’s Computer Science and Artificial Intelligence Library (CSAIL) developed an automated system to gather more data from images to train machine learning models, synthesizing a massive dataset of distinct training examples.

This approach is beneficial for use cases in which high-quality images are available, but there are too few to develop a robust dataset. The synthesized dataset can be used to improve the training of machine learning models, enabling them to detect anatomical structures in new scans.

This image segmentation approach helps address one of the major data quality issues: insufficient data points.

Data Quality

But what about cases with a wealth of relevant data but varying qualities or data synthetization challenges?

In these cases, it’s useful to begin by defining and exploring some common healthcare analytics concepts.

Data quality, as the name suggests, is a way to measure the reliability and accuracy of the data. Addressing quality is critical to healthcare data generation, collection, and processing.

If the data collection process yielded a sufficient number of data points but there is a question of quality, stakeholders can look at the data’s structure and identify whether converting the structure of the datasets into a common format is appropriate. This is known as data standardization, and it can help ensure that the data are consistent, which is necessary for effective analysis.

Data cleaning — flagging and addressing data abnormalities — and data normalization, the process of organizing data, can take standardization even further.

Tools like the United States Core Data for Interoperability (USCDI) and USCDI+ can help in cases where a healthcare organization doesn’t have enough high-quality data.

Data Analysis

In scenarios with a large amount of data, synthesizing the data for analysis creates another potential hurdle.

As seen throughout the COVID-19 pandemic, when data related to the virus became available globally, healthcare leaders faced the challenge of creating high-quality datasets to help researchers answer vital questions about the virus.

In 2020, the White House Office of Science and Technology Policy issued a call to action for experts to synthesize an artificial intelligence (AI) algorithm-friendly COVID-19 dataset to bolster these efforts.

The dataset represents an extensive machine-readable coronavirus literature collection – including over 29,000 articles at the time of creation – designed to help researchers sift through and analyze the data more quickly.

By promoting collaboration among researchers, healthcare institutions, and other stakeholders, initiatives like this can support the efficient synthesis of large-scale, high-quality datasets.


As healthcare organizations become increasingly reliant on analytics algorithms to help them make care decisions, bias is a major hurdle to the safe and effective deployment of these tools.

Tackling algorithmic bias requires stakeholders to be aware of how biases are introduced and reproduced at every stage of algorithm development and deployment. In many algorithms, bias can be baked in almost immediately if the developers rely on biased data.

Causes of Data Bias

The US Department of Health and Human Services (HHS) Office of Minority Health (OMH) indicates that lack of diversity in an algorithm’s training data is a significant source of bias. Further, bias can be coded into algorithms based on developers’ beliefs or assumptions, including implicit and explicit biases.

If, for example, a developer incorrectly assumes that symptoms of a particular condition are more common or severe in one population than another, the resulting algorithm could be biased and perpetuate health disparities.

Some have suggested that bringing awareness to potential biases can remedy the issue of algorithmic bias, but research suggests that a more robust approach is required. One study published in the Future Healthcare Journal in 2021 demonstrated that while bias training can help individuals recognize biases in themselves and others, it is not an effective debiasing strategy.

The OMH recommends best practices beyond bias training, encouraging developers to work with diverse stakeholders to ensure that algorithms are adequately developed, validated, and reviewed to maximize utility and minimize harm.

In scenarios where diverse training data for algorithms is unavailable, techniques like synthetic data can help minimize potential biases.

Strategies for Minimizing Data Bias

In terms of algorithm deployment and monitoring, the OMH suggests that the tools should be implemented gradually and that users should have a way to provide feedback to the developers for future algorithm improvement.

To this end, developers can work with experts and end-users to understand what clinical measures are important to providers, according to researchers from the University of Massachusetts Amherst.

In recent years, healthcare stakeholders have increasingly developed frameworks and best practices to minimize bias in clinical algorithms.

A panel of experts convened by the Agency for Healthcare Research and Quality (AHRQ) and the National Institute on Minority Health and Health Disparities (NIMHD) published a special communications article in the December 2023 issue of JAMA Network Open outlining five principles to address the impact of algorithm bias on racial and ethnic disparities in healthcare.

The framework guides healthcare stakeholders to mitigate and prevent bias at each stage of an algorithm’s life cycle by promoting health equity, ensuring algorithm transparency, earning trust by engaging patients and communities, explicitly identifying fairness issues, and establishing accountability for equity and fairness in outcomes from algorithms.

When trained using high-quality data and deployed in settings that will be monitored and adjusted to minimize biases, algorithms can help address disparities in maternal health, preterm births, and social determinants of health (SDOH).


In algorithm development, data privacy and security are high on the list of concerns. Legal, privacy, and cultural obstacles can keep researchers from accessing the large, diverse data sets needed to train analytics technologies.

Over the years, experts have worked to craft approaches that can balance the need for data access against the need to protect patient privacy.

Machine Learning

In 2020, a team from the University of Iowa (UI) set out to develop a solution to this problem. With a $1 million grant from the National Science Foundation (NSF), UI researchers created a machine learning platform to train algorithms with data from around the world.

The tool is a decentralized, asynchronous solution called ImagiQ, and it relies on an ecosystem of machine learning models so that institutions can select models that work best for their populations. Using the platform, organizations can upload and share the models, but not patient data, with each other.

The researchers indicated that traditional machine learning methods require a centralized database where patient data can be directly accessed for use in model training, but these approaches are often limited by practical issues like information security, patient privacy, data ownership, and the burden on health systems tasked with creating and maintaining those centralized databases.

ImagiQ helps overcome some of these challenges, but it is not the only framework to do so.

Federated Learning

Researchers from the University of Pittsburgh Swanson School of Engineering were awarded $1.7 million from the National Institutes of Health (NIH) in 2022 to advance their efforts to develop a federated learning (FL)-based approach to achieve fairness in AI-assisted medical screening tools.

FL is a privacy-protection method that enables researchers to train AI models across multiple decentralized devices or servers holding local data samples without exchanging them.

The approach is useful for improving model performance without compromising data privacy, as AI trained on one institution’s data typically does not generalize well on data from another.

However, FL is not a perfect solution, as experts from the University of Southern California (USC) Viterbi School of Engineering pointed out at the 2023 International Workshop on Health Intelligence. They posited that FL brings forth multiple concerns, such as its ability to make predictions based on what it’s learned from its training data and the hurdles presented by missing data and the data harmonization process.

The research team presented a framework for addressing these challenges, but there are other tools healthcare stakeholders can use to prioritize data privacy, such as confidential computing or blockchain. These tools center on making the data largely inaccessible and resistant to tampering by unauthorized parties.

Privacy-Enhancing Technologies

Alternatives that do not require significant investments in cloud computing or blockchain are also available to stakeholders through privacy-enhancing technologies (PETs), three of which are particularly suited to healthcare use cases.

Algorithmic PETs — like encryption, differential privacy, and zero-knowledge proofs — protect data privacy by altering how the information is represented while ensuring it is usable. Often, this involves modifying the changeability or traceability of healthcare data.

In contrast, architectural PETs focus on the structure of data or computation environments, rather than how those data are represented, to enable users to exchange information without exchanging any underlying data. Federated learning, secure multi-party computation, and blockchain fall into this PET category.

Augmentation PETs, as the name suggests, augment existing data sources or create fully synthetic ones. This approach can help enhance the availability and utility of data used in healthcare analytics projects. Digital twins and generative adversarial networks are commonly used for this purpose.

But even the most robust data privacy infrastructure cannot compensate for a lack of trust in big data analytics tools.


Just as patients need to trust that analytics algorithms can keep their data safe, providers must trust that these tools can deliver information in a functional, reliable way.

The issue of trustworthy analytics tools has recently taken center stage in conversations around how Americans interact with AI — knowingly and unknowingly — in their daily lives. Healthcare is one of the industries where advanced technologies present the most significant potential for harm, leading the federal government to begin taking steps to guide the deployment and use of algorithms.

In October 2023, President Joe Biden signed the Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence, which outlines safety, security, privacy, equity, and other standards for how industry and government should approach AI innovation.

The order’s directives are broad, as they are designed to apply to all US industries, but it does lay out some industry-specific directives for those looking at how it will impact healthcare. Primarily, the executive order provides a framework for creating standards, laws, and regulations around AI and establishes a roadmap of subsequent actions that government agencies, like HHS, must take to build such a framework.

However, this process will take months, and more robust regulation of healthcare algorithms could take even longer, leading industry stakeholders to develop their own best practices for using analytics technologies in healthcare.

One stakeholder is the National Academy of Medicine (NAM) Artificial Intelligence Code of Conduct (AICC), which represents a collaborative effort among healthcare, research, and patient advocacy groups to create a national architecture for responsible AI use in healthcare.

In a 2024 interview with HealthITAnalytics, NAM leadership emphasized that this governance infrastructure is necessary to gain trust and improve healthcare as advanced technologies become more ubiquitous in care settings.

However, governance structure must be paired with education and clinician support to obtain buy-in from providers.

Some of this can start early, as evidenced by recent work from the University of Texas (UT) health system to incorporate AI training into medical school curriculum. Having staff members dedicated to spearheading analytics initiatives, such as a chief analytics officer, is another approach that healthcare organizations can use to make providers feel more comfortable with these tools.

These staff can also work to bolster trust at the enterprise level by focusing on creating a healthcare data culture, gaining provider buy-in from the top down, and having strategies to address concerns about clinician overreliance on analytics technologies.

With healthcare organizations increasingly leveraging big data analytics tools for enhanced insights and streamlined care processes, overcoming data quality, bias, privacy, and security issues and fostering user trust will be critical for successfully using these models in clinical care.

As research evolves around AI, machine learning, and other analytics algorithms, the industry will keep refining these tools for improved patient care.


Leave a Reply

Your email address will not be published. Required fields are marked *