Data source

The database for aggregate analysis of (AACT) is a publicly available relational database enhanced by The Clinical Trials Transformation Initiative (CTTI) that contains both protocol and result elements for all studies recorded in ClinicalTrials.gov12. On 6 July 2022, the AACT database had 420,268 clinical studies registered from 1999 to July 2022, and this version is extracted in comma-separated values (CSV) format from the database for this research.

Data preparation

Data preparation and analysis is a significant part of our proposed pipeline. There are three study types included in the raw dataset which are “Interventional”, “Observational” and “Patient Registry”. Only “Interventional” studies, the biggest proportion of the three study types, were filtered from the raw data.

Clinical trial success can have two different definitions; the success of the intervention or successful completion of the trial (whether the intervention achieved its objective/s). For the scope of this research, we use the latter definition. There are 14 study status types recorded for this dataset, including recruiting, completed, withdrawn and unknown statuses. The proposed pipeline aims to predict the probability of a study protocol leading to termination and, if so, interpreting this prediction to flag contributed features. Hence, studies that are completed, terminated or withdrawn were extracted from the original dataset. Our supervised machine learning algorithms will learn and output a simplification of these three statuses as a binary output of success or failure.

Missing or erroneous data is common in real world big data sources. The objective of clinical study registries is to provide complete, accurate and timely recorded trial data. Although the emphasis on registering clinical studies and providing quality data increases over time (since 1999), there are still a high number of studies that have a significant number of missing data points or substantial errors13. Figure 2 illustrates a bar plot which shows the average missing value ratios starting from 1999, when the first study was registered to this registry, until 2022, considering the 24 study design features investigated in this initial analysis. Average missing values are calculated as the proportion of sum of missing values over the total number of studies recorded that year. There is a significant decrease in the average missing value rates over the years, especially from 1999 to 2008. After 2011, the average missing value rate is stable below 10. Hence, the studies registered before 2011 were removed from the dataset, leaving 112,647 studies.

Figure 2
figure 2

Average missing value rates (proportion of sum of missing values over total number of studies) per year from 1999 to 2022 considering the 24 study characteristics features used in this study.

Studies by phases

Table 1 gives a summary of the number of studies and their recorded phases in the dataset. For some studies more than one phase can be recorded. In such cases, both phases considered correct during the generation of phase specific subsets. For example, if a study phase is recorded as “Phase2/Phase3”, it will be included in both Phase 2 and Phase 3 subsets.

Table 1 Study phases and number of studies recorded under each phase on

Study characteristics features

Study characteristics features includes logistic, administrative and design features of clinical trials. This section discusses some of the important features selected for the final feature set in more detail. The full list of numerical and categorical features is included in the supplementary materials.

Out of 190,678 studies, 19,252 did not record the number of sites. Although the recent growth in decentralised trials emphasises that sites are not always needed, most of the historical studies do not belong to this category of trials14. The number of clinical sites have an impact on trial enrolment and patient demographics, as the clinical study is limited by the participants who live near the defined sites and can attend study visits. Therefore, we used the number of sites in the final features set.

Defining the primary and secondary outcomes is an essential part of any interventional clinical trial15. The primary outcome measures directly form part of the study hypothesis. The number of primary and secondary outcomes to measure are included in our final feature set as two separate features.

A set of features specific to interventional studies, such as randomisation, intervention model, intervention type, masking and FDA regulation as a binary feature, are added to the final dataset.

Disease category features

Figure 3 illustrates proportion of completed to failed studies by disease category. Recorded conditions and mesh terms are combined to search for the diseases recorded under each disease category. This categorization of specific conditions is the same as that used in the database. As illustrated in Fig. 3, studies under neoplasms and blood lymph conditions categories are the most likely to fail, whereas studies under occupational diseases and disorders of environmental origin are the least likely to fail. The implementation of this categorization allows one study to be recorded under multiple disease categories.

Figure 3
figure 3

Percentage of completed to terminated studies for each disease category recorded on

Eligibility criteria statistical and search features

Eligibility criteria is a free-text column in the raw dataset which includes inclusion and exclusion criteria specified in the study design. Eligibility criteria are implemented to control who can participate in clinical studies. Acceptance of healthy volunteers, and acceptance of patients by gender and age are among the features added, followed by number of inclusion and exclusion criteria, as well as total and average number of words for eligibility criteria per study. 54,758 studies that accepted healthy volunteers had a 7% failure rate, whereas 134,842 studies that did not accept healthy volunteers had a 17% failure rate. The importance of inclusive eligibility criteria has been emphasised increasingly over the years, as exclusion of particular subgroups makes it harder for studies to recruit patients and deliver inclusive outcomes16.

In addition to basic descriptive features generated from the eligibility criteria, our research introduces a set of more complex eligibility criteria search features generated using the public CHIA dataset by Kury et al.11. It is a large, annotated corpus of patient eligibility criteria extracted from 1,000 Phase IV studies registered in Annotating and generating search terms from the free-text eligibility criteria column in the original would result in a hugely manual and slow process with a massive output of search terms. Hence, we propose a more efficient way of generating search terms. The CHIA dataset contains 12,864 inclusion and exclusion criteria annotated with their entity category and value. We use the following category types in CHIA to generate our search terms: “Condition”, “Procedure”, “Person”, “Temporal”, “Drug”, “Observation”, “Mood”, “Visit”.

Category and entity pairs are generated for inclusion and exclusion criteria separately. 12,864 entity category and value pairs are generated as search features. The eligibility free text field in our dataset is separated into two fields, as inclusion and exclusion, and the generated search pairs are used to search the inclusion and exclusion fields from our original dataset. For computational efficiency reasons, we restricted search terms to those with 5 words or less and then search these using a 5-g language model in the original dataset. This process generated a sparse binary dataset of 12,864 features which concatenated to our original features.

Data labelling

Overall status is recorded for every clinical trial in the AACT database. If no participants were enrolled in the trial, the status of that trial is ‘Withdrawn’, and if a trial was stopped prematurely, the status of that trial is ‘Terminated’. Out of 28,098 terminated or withdrawn studies that reported a reason for stopping the study, 9,260 studies prematurely stopped due to reasons related to participant recruitment and enrolment. Trials that are successfully completed have the status ‘Completed’. The classifying factor between studies for supervised machine learning model training is their overall status as being in either the success class or the failure class. Terminated and withdrawn studies are labelled as ‘failure’ and completed studies are labelled as ‘success’. “class 0” and “failure class”, “class 1” and “success class” will be used interchangeably.

Numerical and categorical feature encoding

The final feature set is a mixture of numerical and categorical columns, which requires different methods of encoding. Large public datasets come with a lot of missing and erroneous data. Particularly for numerical features, handling of the missing values could have a big impact on predictive model performances. Multiple Imputation by Chained Equations (MICE) algorithm was selected to handle numerical missing data, as it is a robust and informative method17. Missing cells are imputed through an iterative sequence of predictive models where, in each iteration, one of the features is imputed using other features of the dataset. This algorithm runs until it converges, and all missing numerical feature values are imputed in this process.

One hot encoding is an effective method to encode categorical features. This method generates new binary features for each sub-category of a categorical feature. The method handles missing categorical values by encoding them as zeros.

Train/test datasets

Phase specific datasets for Phase 1, Phase 2 and Phase 3 studies generated for training different models. In order to estimate the performance of our machine learning models, the train-test split method was used18. For the final model, a 70:30 train to test split ratio was selected. The train set is used to train the models, whereas the test set is held aside for the final evaluation of the model. This is an effective and fast approach to test our trained models with data they have never seen before.

Handling data imbalance

Data imbalance is one of the main challenges of using clinical trials dataset for termination classification. The ratio of positive to negative samples for the overall dataset, which contains studies from all phases, is 15:85. Hence, classification would be biased towards the positive class if the imbalance is not handled. This can result in a falsely perceived positive effect on the model accuracy. Therefore, random under-sampling is applied to the training set. According to the defined positive/negative ratio, a necessary number of data points are deleted from the positive class subset. We use a 1:1 ratio for random under-sampling between the negative and positive class. Random under sampling was applied only on training samples after the train test split. Hence, the test set remained imbalanced to preserve a realistic test distribution.

Top feature selection

The feature set size increased significantly due to the addition of eligibility criteria features. In order to achieve the best performance without generating unnecessary noise in the data, feature selection was applied. An ablation study was done to understand the effects of adding more features to the model performance. The number of features vs model error plotted with a purpose to find an elbow point. The elbow point is where the decrease angle of the error line dropped significantly, so that we know adding more features does not have a significant effect on the performance. Once the optimal number of features for training is determined with this method, we selected features according to the k (the number of features needed) highest scores19. We used Analysis of Variance (ANOVA) F score as the scoring function20.

Machine learning model selection

Logistic regression, random forest classifier and extreme gradient boosting classifier (xgBoost) are trained and evaluated. The logistic regression classifier is a simpler algorithm compared to the tree-based ensemble models, such as random forest and extreme gradient boosting21,22. Though feature selection is applied, the final datasets are still large sparse datasets. This ruled out many machine learning architectures.

Model evaluation

Particularly in imbalanced datasets, splitting the dataset into train and test sets drastically decreases the number of samples used for learning. Hence, fivefold cross validation is used for the model evaluation to achieve unbiased metric scores. The dataset split into 5 smaller sets and the model trained 5 times. The performance of the model reported as the average of 5 experiments, and each time a different chunk is used as the test dataset. This provided reliable metric scores to evaluate different models.

Model hyperparameter tuning

Tree based models require careful hyperparameter tuning; however, it is computationally expensive to test every combination of parameters to achieve the best results. Therefore, a strategy is made to find the best possible parameters for the models in hand. In order to prevent overfitting, the initial method is to control the model complexity. Maximum depth of each tree and minimum sum of instance weight needed in each child are the two parameters optimised to control model complexity. Increasing these parameters increases the complexity as well as the risk of overfitting. Furthermore, the second method is to add randomness to make training robust to noise23. Subsampling of training instances and subsampling ratio of columns during construction of each tree are optimised. Optimal parameters were chosen after several iterations following this strategy.

Model interpretations using Shapley Additive exPlanations

SHAP (SHapley Additive exPlanations) is a framework based on Shapley values, a game theory approach24. This method is used to get visual outputs to explain model predictions25. SHAP locally explains the feature contributions on individual predictions by connecting optimal credit allocation to local explanations using Shapley values. A base value and an output value are calculated for each plot. Base value is the average model output based on the training data and output value is the overall addition of the Shapley values for each feature for that instance. This allows us to explain the influence of features to the prediction.


By admin

Leave a Reply

Your email address will not be published. Required fields are marked *