An open-source framework for end-to-end analysis of electronic health record data

All datasets that were used during the development of ehrapy and the use cases were used according to their terms of use as indicated by each provider.

Table of Contents

Design and implementation of ehrapy

A unified pipeline as provided by our ehrapy framework streamlines the analysis of EHR data by providing an efficient, standardized approach, which reduces the complexity and variability in data pre-processing and analysis. This consistency ensures reproducibility of results and facilitates collaboration and sharing within the research community. Additionally, the modular structure allows for easy extension and customization, enabling researchers to adapt the pipeline to their specific needs while building on a solid foundational framework.

ehrapy was designed from the ground up as an open-source effort with community support. The package, as well as all associated tutorials and dataset preparation scripts, are open source. Development takes place publicly on GitHub where the developers discuss feature requests and issues directly with users. This tight interaction between both groups ensures that we implement the most pressing needs to cater the most important use cases and can guide users when difficulties arise. The open-source nature, extensive documentation and modular structure of ehrapy are designed for other developers to build upon and extend ehrapy’s functionality where necessary. This allows us to focus ehrapy on the most important features to keep the number of dependencies to a minimum.

ehrapy was implemented in the Python programming language and builds upon numerous existing numerical and scientific open-source libraries, specifically matplotlib¹⁰¹, seaborn¹⁰², NumPy¹⁰³, numba¹⁰⁴, Scipy¹⁰⁵, scikit-learn⁵³ and Pandas¹⁰⁶. Although taking considerable advantage of all packages implemented, ehrapy also shares the limitations of these libraries, such as a lack of GPU support or small performance losses due to the translation layer cost for operations between the Python interpreter and the lower-level C language for matrix operations. However, by building on very widely used open-source software, we ensure seamless integration and compatibility with a broad range of tools and platforms to promote community contributions. Additionally, by doing so, we enhance security by allowing a larger pool of developers to identify and address vulnerabilities¹⁰⁷. All functions are grouped into task-specific modules whose implementation is complemented with additional dependencies.

Data preparation

Dataloaders

ehrapy is compatible with any type of vectorized data, where vectorized refers to the data being stored in structured tables in either on-disk or database form. The input and output module of ehrapy provides readers for common formats, such as OMOP, CSV tables or SQL databases through Pandas. When reading in such datasets, the data are stored in the appropriate slots in a new AnnData⁴⁶ object. ehrapy’s data module provides access to more than 20 public EHR datasets that feature diseases, including, but not limited to, Parkinson’s disease, breast cancer, chronic kidney disease and more. All dataloaders return AnnData objects to allow for immediate analysis.

AnnData for EHR data

Our framework required a versatile data structure capable of handling various matrix formats, including Numpy¹⁰³ for general use cases and interoperability, Scipy¹⁰⁵ sparse matrices for efficient storage, Dask¹⁰⁸ matrices for larger-than-memory analysis and Awkward array¹⁰⁹ for irregular time-series data. We needed a single data structure that not only stores data but also includes comprehensive annotations for thorough contextual analysis. It was essential for this structure to be widely used and supported, which ensures robustness and continual updates. Interoperability with other analytical packages was a key criterion to facilitate seamless integration within existing tools and workflows. Finally, the data structure had to support both in-memory operations and on-disk storage using formats such as HDF5 (ref. ¹¹⁰) and Zarr¹¹¹, ensuring efficient handling and accessibility of large datasets and the ability to easily share them with collaborators.

All of these requirements are fulfilled by the AnnData format, which is a popular data structure in single-cell genomics. At its core, an AnnData object encapsulates diverse components, providing a holistic representation of data and metadata that are always aligned in dimensions and easily accessible. A data matrix (commonly referred to as ‘X’) stands as the foundational element, embodying the measured data. This matrix can be dense (as Numpy array), sparse (as Scipy sparse matrix) or ragged (as Awkward array) where dimensions do not align within the data matrix. The AnnData object can feature several such data matrices stored in ‘layers’. Examples of such layers can be unnormalized or unencoded data. These data matrices are complemented by an observations (commonly referred to as ‘obs’) segment where annotations on the level of patients or visits are stored. Patients’ age or sex, for instance, are often used as such annotations. The variables (commonly referred to as ‘var’) section complements the observations, offering supplementary details about the features in the dataset, such as missing data rates. The observation-specific matrices (commonly referred to as ‘obsm’) section extends the capabilities of the AnnData structure by allowing the incorporation of observation-specific matrices. These matrices can represent various types of information at the individual cell level, such as principal component analysis (PCA) results, t-distributed stochastic neighbor embedding (t-SNE) coordinates or other dimensionality reduction outputs. Analogously, AnnData features a variables-specific variables (commonly referred to as ‘varm’) component. The observation-specific pairwise relationships (commonly referred to as ‘obsp’) segment complements the ‘obsm’ section by accommodating observation-specific pairwise relationships. This can include connectivity matrices, indicating relationships between patients. The inclusion of an unstructured annotations (commonly referred to as ‘uns’) component further enhances flexibility. This segment accommodates unstructured annotations or arbitrary data that might not conform to the structured observations or variables categories. Any AnnData object can be stored on disk in h5ad or Zarr format to facilitate data exchange.

ehrapy natively interfaces with the scientific Python ecosystem via Pandas¹¹² and Numpy¹⁰³. The development of deep learning models for EHR data¹¹³ is further accelerated through compatibility with pathml¹¹⁴, a unified framework for whole-slide image analysis in pathology, and scvi-tools¹¹⁵, which provides data loaders for loading tensors from AnnData objects into PyTorch¹¹⁶ or Jax arrays¹¹⁷ to facilitate the development of generalizing foundational models for medical artificial intelligence¹¹⁸.

Feature annotation

After AnnData creation, any metadata can be mapped against ontologies using Bionty ( Bionty provides access to the Human Phenotype, Phecodes, Phenotype and Trait, Drug, Mondo and Human Disease ontologies.

Key medical terms stored in an AnnData object in free text can be extracted using the Medical Concept Annotation Toolkit (MedCAT)¹¹⁹.

Data processing

Cohort tracking

ehrapy provides a CohortTracker tool that traces all filtering steps applied to an associated AnnData object. To calculate cohort summary statistics, the implementation makes use of tableone¹²⁰ and can subsequently be plotted as bar charts together with flow diagrams¹²¹ that visualize the order and reasoning of filtering operations.

Basic pre-processing and quality control

ehrapy encompasses a suite of functionalities for fundamental data processing that are adopted from scanpy⁵² but adapted to EHR data:

1.

Regress out: To address unwanted sources of variation, a regression procedure is integrated, enhancing the dataset’s robustness.
2.

Subsample: Selects a specified fraction of observations.
3.

Balanced sample: Balances groups in the dataset by random oversampling or undersampling.
4.

Highly variable features: The identification and annotation of highly variable features following the ‘highly variable genes’ function of scanpy is seamlessly incorporated, providing users with insights into pivotal elements influencing the dataset.

To identify and minimize quality issues, ehrapy provides several quality control functions:

1.

Basic quality control: Determines the relative and absolute number of missing values per feature and per patient.
2.

Winsorization: For data refinement, ehrapy implements a winsorization process, creating a version of the input array less susceptible to extreme values.
3.

Feature clipping: Imposes limits on features to enhance dataset reliability.
4.

Detect biases: Computes pairwise correlations between features, standardized mean differences for numeric features between groups of sensitive features, categorical feature value count differences between groups of sensitive features and feature importances when predicting a target variable.
5.

Little’s MCAR test: Applies Little’s MCAR test whose null hypothesis is that data are MCAR. Rejecting the null hypothesis may not always mean that data are not MCAR, nor is accepting the null hypothesis a guarantee that data are MCAR. For more details, see Schouten et al.¹²².
6.

Summarize features: Calculates statistical indicators per feature, including minimum, maximum and average values. This can be especially useful to reduce complex data with multiple measurements per feature per patient into sets of columns with single values.

Imputation is crucial in data analysis to address missing values, ensuring the completeness of datasets that can be required for specific algorithms. The ‘ehrapy’ pre-processing module offers a range of imputation techniques:

1.

Explicit Impute: Replaces missing values, in either all columns or a user-specified subset, with a designated replacement value.
2.

Simple Impute: Imputes missing values in numerical data using mean, median or the most frequent value, contributing to a more complete dataset.
3.

KNN Impute: Uses k-nearest neighbor imputation to fill in missing values in the input AnnData object, preserving local data patterns.
4.

MissForest Impute: Implements the MissForest strategy for imputing missing data, providing a robust approach for handling complex datasets.
5.

MICE Impute: Applies the MICE algorithm for imputing data. This implementation is based on the miceforest ( package.

Data encoding can be required if categoricals are a part of the dataset to obtain numerical values only. Most algorithms in ehrapy are compatible only with numerical values. ehrapy offers two encoding algorithms based on scikit-learn⁵³:

1.

One-Hot Encoding: Transforms categorical variables into binary vectors, creating a binary feature for each category and capturing the presence or absence of each category in a concise representation.
2.

Label Encoding: Assigns a unique numerical label to each category, facilitating the representation of categorical data as ordinal values and supporting algorithms that require numerical input.

To ensure that the distributions of the heterogeneous data are aligned, ehrapy offers several normalization procedures:

1.

Log Normalization: Applies the natural logarithm function to the data, useful for handling skewed distributions and reducing the impact of outliers.
2.

Max-Abs Normalization: Scales each feature by its maximum absolute value, ensuring that the maximum absolute value for each feature is 1.
3.

Min-Max Normalization: Transforms the data to a specific range (commonly (0, 1)) by scaling each feature based on its minimum and maximum values.
4.

Power Transformation Normalization: Applies a power transformation to make the data more Gaussian like, often useful for stabilizing variance and improving the performance of models sensitive to distributional assumptions.
5.

Quantile Normalization: Aligns the distributions of multiple variables, ensuring that their quantiles match, which can be beneficial for comparing datasets or removing batch effects.
6.

Robust Scaling Normalization: Scales data using the interquartile range, making it robust to outliers and suitable for datasets with extreme values.
7.

Scaling Normalization: Standardizes data by subtracting the mean and dividing by the standard deviation, creating a distribution with a mean of 0 and a standard deviation of 1.
8.

Offset to Positive Values: Shifts all values by a constant offset to make all values non-negative, with the lowest negative value becoming 0.

Dataset shifts can be corrected using the scanpy implementation of the ComBat¹²³ algorithm, which employs a parametric and non-parametric empirical Bayes framework for adjusting data for batch effects that is robust to outliers.

Finally, a neighbors graph can be efficiently computed using scanpy’s implementation.

Embeddings

To obtain meaningful lower-dimensional embeddings that can subsequently be visualized and reused for downstream algorithms, ehrapy provides the following algorithms based on scanpy’s implementation:

1.

t-SNE: Uses a probabilistic approach to embed high-dimensional data into a lower-dimensional space, emphasizing the preservation of local similarities and revealing clusters in the data.
2.

UMAP: Embeds data points by modeling their local neighborhood relationships, offering an efficient and scalable technique that captures both global and local structures in high-dimensional data.
3.

Force-Directed Graph Drawing: Uses a physical simulation to position nodes in a graph, with edges representing pairwise relationships, creating a visually meaningful representation that emphasizes connectedness and clustering in the data.
4.

Diffusion Maps: Applies spectral methods to capture the intrinsic geometry of high-dimensional data by modeling diffusion processes, providing a way to uncover underlying structures and patterns.
5.

Density Calculation in Embedding: Quantifies the density of observations within an embedding, considering conditions or groups, offering insights into the concentration of data points in different regions and aiding in the identification of densely populated areas.

Clustering

ehrapy further provides algorithms for clustering and trajectory inference based on scanpy:

1.

Leiden Clustering: Uses the Leiden algorithm to cluster observations into groups, revealing distinct communities within the dataset with an emphasis on intra-cluster cohesion.
2.

Hierarchical Clustering Dendrogram: Constructs a dendrogram through hierarchical clustering based on specified group by categories, illustrating the hierarchical relationships among observations and facilitating the exploration of structured patterns.

Feature ranking

ehrapy provides two ways of ranking feature contributions to clusters and target variables:

1.

Statistical tests: To compare any obtained clusters to obtain marker features that are significantly different between the groups, ehrapy extends scanpy’s ‘rank genes groups’. The original implementation, which features a t-test for numerical data, is complemented by a g-test for categorical data.
2.

Feature importance: Calculates feature rankings for a target variable using linear regression, support vector machine or random forest models from scikit-learn. ehrapy evaluates the relative importance of each predictor by fitting the model and extracting model-specific metrics, such as coefficients or feature importances.

Dataset integration

Based on scanpy’s ‘ingest’ function, ehrapy facilitates the integration of labels and embeddings from a well-annotated reference dataset into a new dataset, enabling the mapping of cluster annotations and spatial relationships for consistent comparative analysis. This process ensures harmonized clinical interpretations across datasets, especially useful when dealing with multiple experimental diseases or batches.

Knowledge inference

Survival analysis

ehrapy’s implementation of survival analysis algorithms is based on lifelines¹²⁴:

1.

Ordinary Least Squares (OLS) Model: Creates a linear regression model using OLS from a specified formula and an AnnData object, allowing for the analysis of relationships between variables and observations.
2.

Generalized Linear Model (GLM): Constructs a GLM from a given formula, distribution and AnnData, providing a versatile framework for modeling relationships with nonlinear data structures.
3.

Kaplan–Meier: Fits the Kaplan–Meier curve to generate survival curves, offering a visual representation of the probability of survival over time in a dataset.
4.

Cox Hazard Model: Constructs a Cox proportional hazards model using a specified formula and an AnnData object, enabling the analysis of survival data by modeling the hazard rates and their relationship to predictor variables.
5.

Log-Rank Test: Calculates the P value for the log-rank test, comparing the survival functions of two groups, providing statistical significance for differences in survival distributions.
6.

GLM Comparison: Given two fit GLMs, where the larger encompasses the parameter space of the smaller, this function returns the P value, indicating the significance of the larger model and adding explanatory power beyond the smaller model.

Trajectory inference

Trajectory inference is a computational approach that reconstructs and models the developmental paths and transitions within heterogeneous clinical data, providing insights into the temporal progression underlying complex systems. ehrapy offers several inbuilt algorithms for trajectory inference based on scanpy:

1.

Diffusion Pseudotime: Infers the progression of observations by measuring geodesic distance along the graph, providing a pseudotime metric that represents the developmental trajectory within the dataset.
2.

Partition-based Graph Abstraction (PAGA): Maps out the coarse-grained connectivity structures of complex manifolds using a partition-based approach, offering a comprehensive visualization of relationships in high-dimensional data and aiding in the identification of macroscopic connectivity patterns.

Because ehrapy is compatible with scverse, further trajectory inference-based algorithms, such as CellRank, can be seamlessly applied.

Causal inference

ehrapy’s causal inference module is based on ‘dowhy’⁶⁹. It is based on four key steps that are all implemented in ehrapy:

1.

Graphical Model Specification: Define a causal graphical model representing relationships between variables and potential causal effects.
2.

Causal Effect Identification: Automatically identify whether a causal effect can be inferred from the given data, addressing confounding and selection bias.
3.

Causal Effect Estimation: Employ automated tools to estimate causal effects, using methods such as matching, instrumental variables or regression.
4.

Sensitivity Analysis and Testing: Perform sensitivity analysis to assess the robustness of causal inferences and conduct statistical testing to determine the significance of the estimated causal effects.

Patient stratification

ehrapy’s complete pipeline from pre-processing to the generation of lower-dimensional embeddings, clustering, statistical comparison between determined groups and more facilitates the stratification of patients.

Visualization

ehrapy features an extensive visualization pipeline that is customizable and yet offers reasonable defaults. Almost every analysis function is matched with at least one visualization function that often shares the name but is available through the plotting module. For example, after importing ehrapy as ‘ep’, ‘ep.tl.umap(adata)’ runs the UMAP algorithm on an AnnData object, and ‘ep.pl.umap(adata)’ would then plot a scatter plot of the UMAP embedding.

ehrapy further offers a suite of more generally usable and modifiable plots:

1.

Scatter Plot: Visualizes data points along observation or variable axes, offering insights into the distribution and relationships between individual data points.
2.

Heatmap: Represents feature values in a grid, providing a comprehensive overview of the data’s structure and patterns.
3.

Dot Plot: Displays count values of specified variables as dots, offering a clear depiction of the distribution of counts for each variable.
4.

Filled Line Plot: Illustrates trends in data with filled lines, emphasizing variations in values over a specified axis.
5.

Violin Plot: Presents the distribution of data through mirrored density plots, offering a concise view of the data’s spread.
6.

Stacked Violin Plot: Combines multiple violin plots, stacked to allow for visual comparison of distributions across categories.
7.

Group Mean Heatmap: Creates a heatmap displaying the mean count per group for each specified variable, providing insights into group-wise trends.
8.

Hierarchically Clustered Heatmap: Uses hierarchical clustering to arrange data in a heatmap, revealing relationships and patterns among variables and observations.
9.

Rankings Plot: Visualizes rankings within the data, offering a clear representation of the order and magnitude of values.
10.

Dendrogram Plot: Plots a dendrogram of categories defined in a group by operation, illustrating hierarchical relationships within the dataset.

Benchmarking ehrapy

We generated a subset of the UKB data selecting 261 features and 488,170 patient visits. We removed all features with missingness rates greater than 70%. To demonstrate speed and memory consumption for various scenarios, we subsampled the data to 20%, 30% and 50%. We ran a minimal ehrapy analysis pipeline on each of those subsets and the full data, including the calculation of quality control metrics, filtering of variables by a missingness threshold, nearest neighbor imputation, normalization, dimensionality reduction and clustering (Supplementary Table 1). We conducted our benchmark on a single CPU with eight threads and 60 GB of maximum memory.

ehrapy further provides out-of-core implementations using Dask¹⁰⁸ for many algorithms in ehrapy, such as our normalization functions or our PCA implementation. Out-of-core computation refers to techniques that process data that do not fit entirely in memory, using disk storage to manage data overflow. This approach is crucial for handling large datasets without being constrained by system memory limits. Because the principal components get reused for other computationally expensive algorithms, such as the neighbors graph calculation, it effectively enables the analysis of very large datasets. We are currently working on supporting out-of-core computation for all computationally expensive algorithms in ehrapy.

We demonstrate the memory benefits in a hosted tutorial where the in-memory pipeline for 50,000 patients with 1,000 features required about 2 GB of memory, and the corresponding out-of-core implementation required less than 200 MB of memory.

The code for benchmarking is available at The implementation of ehrapy is accessible at together with extensive API documentation and tutorials at https://ehrapy.readthedocs.io.

PIC database analysis

Study design

We collected clinical data from the PIC⁴³ version 1.1.0 database. PIC is a single-center, bilingual (English and Chinese) database hosting information of children admitted to critical care units at the Children’s Hospital of Zhejiang University School of Medicine in China. The requirement for individual patient consent was waived because the study did not impact clinical care, and all protected health information was de-identified. The database contains 13,499 distinct hospital admissions of 12,881 distinct pediatric patients. These patients were admitted to five ICU units with 119 total critical care beds—GICU, PICU, SICU, CICU and NICU—between 2010 and 2018. The mean age of the patients was 2.5 years, of whom 42.5% were female. The in-hospital mortality was 7.1%; the mean hospital stay was 17.6 d; the mean ICU stay was 9.3 d; and 468 (3.6%) patients were admitted multiple times. Demographics, diagnoses, doctors’ notes, laboratory and microbiology tests, prescriptions, fluid balances, vital signs and radiographics reports were collected from all patients. For more details, see the original publication of Zeng et al.⁴³.

Study participants

Individuals older than 18 years were excluded from the study. We grouped the data into three distinct groups: ‘neonates’ (0–28 d of age; 2,968 patients), ‘infants’ (1–12 months of age; 4,876 patients) and ‘youths’ (13 months to 18 years of age; 6,097 patients). We primarily analyzed the ‘youths’ group with the discharge diagnosis ‘unspecified pneumonia’ (277 patients).

Data collection

The collected clinical data included demographics, laboratory and vital sign measurements, diagnoses, microbiology and medication information and mortality outcomes. The five-character English ICD-10 codes were used, whose values are based on the seven-character Chinese ICD-10 codes.

Dataset extraction and analysis

We downloaded the PIC database of version 1.1.0 from Physionet¹ to obtain 17 CSV tables. Using Pandas, we selected all information with more than 50% coverage rate, including demographics and laboratory and vital sign measurements (Fig. 2). To reduce the amount of noise, we calculated and added only the minimum, maximum and average of all measurements that had multiple values per patient. Examination reports were removed because they describe only diagnostics and not detailed findings. All further diagnoses and microbiology and medication information were included into the observations slot to ensure that the data were not used for the calculation of embeddings but were still available for the analysis. This ensured that any calculated embedding would not be divided into treated and untreated groups but, rather, solely based on phenotypic features. We imputed all missing data through k-nearest neighbors imputation (k = 20) using the knn_impute function of ehrapy. Next, we log normalized the data with ehrapy using the log_norm function. Afterwards, we winsorized the data using ehrapy’s winsorize function to obtain 277 ICU visits (n = 265 patients) with 572 features. Of those 572 features, 254 were stored in the matrix X and the remaining 318 in the ‘obs’ slot in the AnnData object. For clustering and visualization purposes, we calculated 50 principal components using ehrapy’s pca function. The obtained principal component representation was then used to calculate a nearest neighbors graph using the neighbors function of ehrapy. The nearest neighbors graph then served as the basis for a UMAP embedding calculation using ehrapy’s umap function.

Patient stratification

We applied the community detection algorithm Leiden with resolution 0.6 on the nearest neighbor graph using ehrapy’s leiden function. The four obtained clusters served as input for two-sided t-tests for all numerical values and two-sided g-tests for all categorical values for all four clusters against the union of all three other clusters, respectively. This was conducted using ehrapy’s rank_feature_groups function, which also corrects P values for multiple testing with the Benjamini–Hochberg method¹²⁵. We presented the four groups and the statistically significantly different features between the groups to two pediatricians who annotated the groups with labels.

Our determined groups can be confidently labeled owing to their distinct clinical profiles. Nevertheless, we could only take into account clinical features that were measured. Insightful features, such as lung function tests, are missing. Moreover, the feature representation of the time-series data is simplified, which can hide some nuances between the groups. Generally, deciding on a clustering resolution is difficult. However, more fine-grained clusters obtained via higher clustering resolutions may become too specific and not generalize well enough.

Kaplan–Meier survival analysis

We selected patients with up to 360 h of total stay for Kaplan–Meier survival analysis to ensure a sufficiently high number of participants. We proceeded with the AnnData object prepared as described in the ‘Patient stratification’ subsection to conduct Kaplan–Meier analysis among all four determined pneumonia groups using ehrapy’s kmf function. Significance was tested through ehrapy’s test_kmf_logrank function, which tests whether two Kaplan–Meier series are statistically significant, employing a chi-squared test statistic under the null hypothesis. Let h_i(t) be the hazard ratio of group i at time t and c a constant that represents a proportional change in the hazard ratio between the two groups, then:

$$\rmH_\rmo:\rmh_1(\rmt)=\rmh_2(\rmt)$$

$$\rmH_\rma:\rmh_1(\rmt)=\rmc* {\rmh}_2(\rmt),\rmc\ne 1$$

This implicitly uses the log-rank weights. An additional Kaplan–Meier analysis was conducted for all children jointly concerning the liver markers AST, ALT and GGT. To determine whether measurements were inside or outside the norm range, we used reference ranges (Supplementary Table 2). P values less than 0.05 were labeled significant.

Our Kaplan–Meier curve analysis depends on the groups being well defined and shares the same limitations as the patient stratification. Additionally, the analysis is sensitive to the reference table where we selected limits that generalize well for the age ranges, but, due to children of different ages being examined, they may not necessarily be perfectly accurate for all children.

Causal effect of mechanism of action on LOS

Although the dataset was not initially intended for investigating causal effects of interventions, we adapted it for this purpose by focusing on the LOS in the ICU, measured in months, as the outcome variable. This choice aligns with the clinical aim of stabilizing patients sufficiently for ICU discharge. We constructed a causal graph to explore how different drug administrations could potentially reduce the LOS. Based on consultations with clinicians, we included several biomarkers of liver damage (AST, ALT and GGT) and inflammation (CRP and PCT) in our model. Patient age was also considered a relevant variable.

Because several different medications act by the same mechanisms, we grouped specific medications by their drug classes This grouping was achieved by cross-referencing the drugs listed in the dataset with DrugBank release 5.1 (ref. ¹²⁶), using Levenshtein distances for partial string matching. After manual verification, we extracted the corresponding DrugBank categories, counted the number of features per category and compiled a list of commonly prescribed medications, as advised by clinicians. This approach facilitated the modeling of the causal graph depicted in Fig. 4, where an intervention is defined as the administration of at least one drug from a specified category.

Causal inference was then conducted with ehrapy’s ‘dowhy’⁶⁹-based causal inference module using the expert-curated causal graph. Medication groups were designated as causal interventions, and the LOS was the outcome of interest. Linear regression served as the estimation method for analyzing these causal effects. We excluded four patients from the analysis owing to their notably long hospital stays exceeding 90 d, which were deemed outliers. To validate the robustness of our causal estimates, we incorporated several refutation methods:

Placebo Treatment Refuter: This method involved replacing the treatment assignment with a placebo to test the effect of the treatment variable being null.
Random Common Cause: A randomly generated variable was added to the data to assess the sensitivity of the causal estimate to the inclusion of potential unmeasured confounders.
Data Subset Refuter: The stability of the causal estimate was tested across various random subsets of the data to ensure that the observed effects were not dependent on a specific subset.
Add Unobserved Common Cause: This approach tested the effect of an omitted variable by adding a theoretically relevant unobserved confounder to the model, evaluating how much an unmeasured variable could influence the causal relationship.
Dummy Outcome: Replaces the true outcome variable with a random variable. If the causal effect nullifies, it supports the validity of the original causal relationship, indicating that the outcome is not driven by random factors.
Bootstrap Validation: Employs bootstrapping to generate multiple samples from the dataset, testing the consistency of the causal effect across these samples.

The selection of these refuters addresses a broad spectrum of potential biases and model sensitivities, including unobserved confounders and data dependencies. This comprehensive approach ensures robust verification of the causal analysis. Each refuter provides an orthogonal perspective, targeting specific vulnerabilities in causal analysis, which strengthens the overall credibility of the findings.

UKB analysis

Study population

We used information from the UKB cohort, which includes 502,164 study participants from the general UK population without enrichment for specific diseases. The study involved the enrollment of individuals between 2006 and 2010 across 22 different assessment centers throughout the United Kingdom. The tracking of participants is still ongoing. Within the UKB dataset, metabolomics, proteomics and retinal optical coherence tomography data are available for a subset of individuals without any enrichment for specific diseases. Additionally, EHRs, questionnaire responses and other physical measures are available for almost everyone in the study. Furthermore, a variety of genotype information is available for nearly the entire cohort, including whole-genome sequencing, whole-exome sequencing, genotyping array data as well as imputed genotypes from the genotyping array⁴⁴. Because only the latter two are available for download, and are sufficient for polygenic risk score calculation as performed here, we used the imputed genotypes in the present study. Participants visited the assessment center up to four times for additional and repeat measurements and completed additional online follow-up questionnaires.

In the present study, we restricted the analyses to data obtained from the initial assessment, including the blood draw, for obtaining the metabolomics data and the retinal imaging as well as physical measures. This restricts the study population to 33,521 individuals for whom all of these modalities are available. We have a clear study start point for each individual with the date of their initial assessment center visit. The study population has a mean age of 57 years, is 54% female and is censored at age 69 years on average; 4.7% experienced an incident myocardial infarction; and 8.1% have prevalent type 2 diabetes. The study population comes from six of the 22 assessment centers due to the retinal imaging being performed only at those.

Data collection

For the myocardial infarction endpoint definition, we relied on the first occurrence data available in the UKB, which compiles the first date that each diagnosis was recorded for a participant in a hospital in ICD-10 nomenclature. Subsequently, we mapped these data to phecodes and focused on phecode 404.1 for myocardial infarction.

The Framingham Risk Score was developed on data from 8,491 participants in the Framingham Heart Study to assess general cardiovascular risk⁷⁷. It includes easily obtainable predictors and is, therefore, easily applicable in clinical practice, although newer and more specific risk scores exist and might be used more frequently. It includes age, sex, smoking behavior, blood pressure, total and low-density lipoprotein cholesterol as well as information on insulin, antihypertensive and cholesterol-lowering medications, all of which are routinely collected in the UKB and used in this study as the Framingham feature set.

The metabolomics data used in this study were obtained using proton NMR spectroscopy, a low-cost method with relatively low batch effects. It covers established clinical predictors, such as albumin and cholesterol, as well as a range of lipids, amino acids and carbohydrate-related metabolites.

The retinal optical coherence tomography–derived features were returned by researchers to the UKB^75,76. They used the available scans and determined the macular volume, macular thickness, retinal pigment epithelium thickness, disc diameter, cup-to-disk ratio across different regions as well as the thickness between the inner nuclear layer and external limiting membrane, inner and outer photoreceptor segments and the retinal pigment epithelium across different regions. Furthermore, they determined a wide range of quality metrics for each scan, including the image quality score, minimum motion correlation and inner limiting membrane (ILM) indicator.

Data analysis

After exporting the data from the UKB, all timepoints were transformed into participant age entries. Only participants without prevalent myocardial infarction (relative to the first assessment center visit at which all data were collected) were included.

The data were pre-processed for retinal imaging and metabolomics subsets separately, to enable a clear analysis of missing data and allow for the k-nearest neighbors–based imputation (k = 20) of missing values when less than 10% were missing for a given participant. Otherwise, participants were dropped from the analyses. The imputed genotypes and Framingham analyses were available for almost every participant and, therefore, not imputed. Individuals without them were, instead, dropped from the analyses. Because genetic risk modeling poses entirely different methodological and computational challenges, we applied a published polygenic risk score for coronary heart disease using 6.6 million variants⁷³. This was computed using the plink2 score option on the imputed genotypes available in the UKB.

UMAP embeddings were computed using default parameters on the full feature sets with ehrapy’s umap function. For all analyses, the same time-to-event and event-indicator columns were used. The event indicator is a Boolean variable indicating whether a myocardial infarction was observed for a study participant. The time to event is defined as the timespan between the start of the study, in this case the date of the first assessment center visit. Otherwise, it is the timespan from the start of the study to the start of censoring; in this case, this is set to the last date for which EHRs were available, unless a participant died, in which case the date of death is the start of censoring. Kaplan–Meier curves and Cox proportional hazards models were fit using ehrapy’s survival analysis module and the lifelines¹²⁴ package’s Cox-PHFitter function with default parameters. For Cox proportional hazards models with multiple feature sets, individually imputed and quality-controlled feature sets were concatenated, and the model was fit on the resulting matrix. Models were evaluated using the C-index¹²⁷ as a metric. It can be seen as an extension of the common area under the receiver operator characteristic score to time-to-event datasets, in which events are not observed for every sample and which ranges from 0.0 (entirely false) over 0.5 (random) to 1.0 (entirely correct). CIs for the C-index were computed based on bootstrapping by sampling 1,000 times with replacement from all computed partial hazards and computing the C-index over each of these samples. The percentiles at 2.5% and 97.5% then give the upper and lower confidence bound for the 95% CIs.

In all UKB analyses, the unit of study for a statistical test or predictive model is always an individual study participant.

The generalizability of the analysis is limited as the UK Biobank cohort may not represent the general population, with potential selection biases and underrepresentation of the different demographic groups. Additionally, by restricting analysis to initial assessment data and censoring based on the last available EHR or date of death, our analysis does not account for longitudinal changes and can introduce follow-up bias, especially if participants lost to follow-up have different risk profiles.

In-depth quality control of retina-derived features

A UMAP plot of the retina-derived features indicating the assessment centers shows a cluster of samples that lie somewhat outside the general population and mostly attended the Birmingham assessment center (Fig. 5b). To further investigate this, we performed Leiden clustering of resolution 0.3 (Extended Data Fig. 9a) and isolated this group in cluster 5. When comparing cluster 5 to the rest of the population in the retina-derived feature space, we noticed that many individuals in cluster 5 showed overall retinal pigment epithelium (RPE) thickness measures substantially elevated over the rest of the population in both eyes (Extended Data Fig. 9b), which is mostly a feature of this cluster (Extended Data Fig. 9c). To investigate potential confounding, we computed ratios between cluster 5 and the rest of the population over the ‘obs’ DataFrame containing the Framingham features, diabetes-related phecodes and genetic principal components. Out of the top and bottom five highest ratios observed, six are in genetic principal components, which are commonly used to represent genetic ancestry in a continuous space (Extended Data Fig. 9d). Additionally, diagnoses for type 1 and type 2 diabetes and antihypertensive use are enriched in cluster 5. Further investigating the ancestry, we computed log ratios for self-reported ancestries and absolute counts, which showed no robust enrichment and depletion effects.

A closer look at three quality control measures of the imaging pipeline revealed that cluster 5 was an outlier in terms of either image quality (Extended Data Fig. 9e) or minimum motion correlation (Extended Data Fig. 9f) and the ILM indicator (Extended Data Fig. 9g), all of which can be indicative of artifacts in image acquisition and downstream processing¹²⁸. Subsequently, we excluded 301 individuals from cluster 5 from all analyses.

COVID-19 chest-x-ray fate determination

Dataset overview

We used the public BrixIA COVID-19 dataset, which contains 192 chest x-ray images annotated with BrixIA scores⁸². Hereby, six regions were annotated by a senior radiologist with more than 20 years of experience and a junior radiologist with a disease severity score ranging from 0 to 3. A global score was determined as the sum of all of these regions and, therefore, ranges from 0 to 18 (S-Global). S-Global scores of 0 were classified as normal. Images that only had severity values up to 1 in all six regions were classified as mild. Images with severity values greater than or equal to 2, but a S-Global score of less than 7, were classified as moderate. All images that contained at least one 3 in any of the six regions with a S-Global score between 7 and 10 were classified as severe, and all remaining images with S-Global scores greater than 10 with at least one 3 were labeled critical. The dataset and instructions to download the images can be found at https://github.com/ieee8023/covid-chestxray-dataset.

Dataset extraction and analysis

We first resized all images to 224 × 224. Afterwards, the images underwent a random affine transformation that involved rotation, translation and scaling. The rotation angle was randomly selected from a range of −45° to 45°. The images were also subject to horizontal and vertical translation, with the maximum translation being 15% of the image size in either direction. Additionally, the images were scaled by a factor ranging from 0.85 to 1.15. The purpose of applying these transformations was to enhance the dataset and introduce variations, ultimately improving the robustness and generalization of the model.

To generate embeddings, we used a pre-trained DenseNet model with weights densenet121-res224-all of TorchXRayVision¹²⁹. A DenseNet is a convolutional neural network that makes use of dense connections between layers (Dense Blocks) where all layers (with matching feature map sizes) directly connect with each other. To maintain a feed-forward nature, every layer in the DenseNet architecture receives supplementary inputs from all preceding layers and transmits its own feature maps to all subsequent layers. The model was trained on the nih-pc-chex-mimic_ch-google-openi-rsna dataset¹³⁰.

Next, we calculated 50 principal components on the feature representation of the DenseNet model of all images using ehrapy’s pca function. The principal component representation served as input for a nearest neighbors graph calculation using ehrapy’s neighbors function. This graph served as the basis for the calculation of a UMAP embedding with three components that was finally visualized using ehrapy.

We randomly picked a root in the group of images that was labeled ‘Normal’. First, we calculated so-called pseudotime by fitting a trajectory through the calculated UMAP space using diffusion maps as implemented in ehrapy’s dpt function⁵⁷. Each image’s pseudotime value represents its estimated position along this trajectory, serving as a proxy for its severity stage relative to others in the dataset. To determine fates, we employed CellRank^58,59 with the PseudotimeKernel. This kernel computes transition probabilities for patient visits based on the connectivity of the k-nearest neighbors graph and the pseudotime values of patient visits, which resembles their progression through a process. Directionality is infused in the nearest neighbors graph in this process where the kernel either removes or downweights edges in the graph that contradict the directional flow of increasing pseudotime, thereby refining the graph to better reflect the developmental trajectory. We computed the transition matrix with a soft threshold scheme (Parameter of the PseudotimeKernel), which downweights edges that point against the direction of increasing pseudotime. Finally, we calculated a projection on top of the UMAP embedding with CellRank using the plot_projection function of the PseudotimeKernel that we subsequently plotted.

This analysis is limited by the small dataset of 192 chest x-ray images, which may affect the model’s generalizability and robustness. Annotation subjectivity from radiologists can further introduce variability in severity scores. Additionally, the random selection of a root from ‘Normal’ images can introduce bias in pseudotime calculations and subsequent analyses.

Diabetes 130-US hospitals analysis

Study population

We used data from the Diabetes 130-US hospitals dataset that were collected between 1999 and 2008. It contains clinical care information at 130 hospitals and integrated delivery networks. The extracted database information pertains to hospital admissions specifically for patients diagnosed with diabetes. These encounters required a hospital stay ranging from 1 d to 14 d, during which both laboratory tests and medications were administered. The selection criteria focused exclusively on inpatient encounters with these defined characteristics. More specifically, we used a version that was curated by the Fairlearn team where the target variable ‘readmitted’ was binarized and a few features renamed or binned ( The dataset contains 101,877 patient visits and 25 features. The dataset predominantly consists of White patients (74.8%), followed by African Americans (18.9%), with other racial groups, such as Hispanic, Asian and Unknown categories, comprising smaller percentages. Females make up a slight majority in the data at 53.8%, with males accounting for 46.2% and a negligible number of entries listed as unknown or invalid. A substantial majority of the patients are over 60 years of age (67.4%), whereas those aged 30–60 years represent 30.2%, and those 30 years or younger constitute just 2.5%.

Data analysis

All of the following descriptions start by loading the Fairlearn version of the Diabetes 130-US hospitals dataset using ehrapy’s dataloader as an AnnData object.

Selection and filtering bias

An overview of sensitive variables was generated using tableone. Subsequently, ehrapy’s CohortTracker was used to track the age, gender and race variables. The cohort was filtered for all Medicare recipients and subsequently plotted.

Surveillance bias

We plotted the HbA1c measurement ratios using ehrapy’s catplot.

Missing data and imputation bias

MCAR-type missing data for the number of medications variable (‘num_medications‘) were introduced by randomly setting 30% of the variables to be missing using Numpy’s choice function. We tested that the data are MCAR by applying ehrapy’s implementation of Little’s MCAR test, which returned a non-significant P value of 0.71. MAR data for the number of medications variable (‘num_medications‘) were introduced by scaling the ‘time_in_hospital’ variable to have a mean of 0 and a standard deviation of 1, adjusting these values by multiplying by 1.2 and subtracting 0.6 to influence overall missingness rate, and then using these values to generate MAR data in the ‘num_medications’ variable via a logistic transformation and binomial sampling. We verified that the newly introduced missing values are not MCAR with respect to the ‘time_in_hospital’ variable by applying ehrapy’s implementation of Little’s test, which was significant (0.01 × 10⁻²). The missing data were imputed using ehrapy’s mean imputation and MissForest implementation.

Algorithmic bias

Variables ‘race’, ‘gender’, ‘age’, ‘readmitted’, ‘readmit_binary’ and ‘discharge_disposition_id’ were moved to the ‘obs’ slot of the AnnData object to ensure that they were not used for model training. We built a binary label ‘readmit_30_days’ indicating whether a patient had been readmitted in fewer than 30 d. Next, we combined the ‘Asian’ and ‘Hispanic’ categories into a single ‘Other’ category within the ‘race’ column of our AnnData object and then filtered out and discarded any samples labeled as ‘Unknown/Invalid’ under the ‘gender‘ column and subsequently moved the ‘gender’ data to the variable matrix X of the AnnData object. All categorical variables got encoded. The data were split into train and test groups with a test size of 50%. The data were scaled, and a logistic regression model was trained using scikit-learn, which was also used to determine the balanced accuracy score. Fairlearn’s MetricFrame function was used to inspect the target model performance against the sensitive variable ‘race’. We subsequently fit Fairlearn’s ThresholdOptimizer using the logistic regression estimator with balanced_accuracy_score as the target object. The algorithmic demonstration of Fairlearn’s abilities on this dataset is shown here:

Normalization bias

We one-hot encoded all categorical variables with ehrapy using the encode function. We applied ehrapy’s implementation of scaling normalization with and without the ‘Age group’ variable as group key to scale the data jointly and separately using ehrapy’s scale_norm function.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

link

Design and implementation of ehrapy

Data preparation

Dataloaders

AnnData for EHR data

Feature annotation

Data processing

Cohort tracking

Basic pre-processing and quality control

Embeddings

Clustering

Feature ranking

Dataset integration

Knowledge inference

Survival analysis

Trajectory inference

Causal inference

Patient stratification

Visualization

Benchmarking ehrapy

PIC database analysis

Study design

Study participants

Data collection

Dataset extraction and analysis

Patient stratification

Kaplan–Meier survival analysis

Causal effect of mechanism of action on LOS

UKB analysis

Study population

Data collection

Data analysis

In-depth quality control of retina-derived features

COVID-19 chest-x-ray fate determination

Dataset overview

Dataset extraction and analysis

Diabetes 130-US hospitals analysis

Study population

Data analysis

Selection and filtering bias

Surveillance bias

Missing data and imputation bias

Algorithmic bias

Normalization bias

Reporting summary

More Stories

Systematic analysis of healthcare big data analytics for efficient care and disease diagnosing

Using predictive analytics in health care

LexisNexis Risk Solutions Launches Research-Ready Healthcare Data Blends to Expedite Research |

Leave a Reply Cancel reply

You may have missed

Shaping the future of healthcare analytics: Nidhi Shashikumar’s data-driven Impact

Healthcare Data Analytics Market Size Analysis by Application,

NEC Orchestrating Future Fund invests in Aetion, provider of healthcare analytics: Press Releases

A ‘transformative experience’ building data skills and lasting relationships