Optimized machine learning mechanism for big data healthcare system to predict disease risk factor

The Windows 10 Python Environment is used to validate the proposed DRFBPS model. The datasets have been collected, and they undergo preprocessing to remove errors and feature selection to select the needed features for accurate prediction. The Red Fox Fitness function selects this feature. Based on the selected features, the disease has been predicted. The computational time of the suggested DRFBPS is also a major concern for healthcare applications. Here, the training took an efficient time of 150 s. Furthermore, memory usage was measured at 209.87 MB, which reflects effective resource usage. For inference, the DRFBPS model is optimized to run with optimal feature selection to achieve quicker prediction times with little computational overhead. The necessary metrics for the proposed DRFBPS are depicted in Table 1.
Case study
Initially, the healthcare dataset was gathered and loaded from the Kaggle standard site. It is the heart attack risk prediction (HARP) dataset. The dataset contains 88,414 records with a size of 18.43 MB. The dataset has 46,944 low-risk samples and 41,470 high-risk samples. Of that data, it is split into 70% for training as 32,861 low-risk and 29,029 high-risk instances. Similarly, the data is split into 30% for testing as 14,083 low-risk and 12,441 high-risk instances. The key information gathered from the health care data includes age, gender, blood pressure readings (systolic and diastolic), body temperature, and cholesterol rates (low, high, and total). Table 2 illustrates the risk characteristics and the embedded value consideration for each.
Table 2 demonstrates the risk features for heart disease with its labeled code for two stages: low and high risk 0 and 1, which are the two states in which the model is intended to establish high and low levels of characteristics.
The Shapley Additive Explanations (SHAP) plot in Fig. 5 provides an insight into the impact of different features on the model output. The top bar plot shows the SHAP values on average, and the features Heart, LDL Cholesterol (LDL Chol), and Age have the highest impact on model predictions. The bottom summary plot shows the direction and magnitude of feature contributions, where each point represents an instance in the data, with color coding (blue for low values and pink for high values). Features like Heart and LDL Cholesterol have high impacts on predictions, with high values (pink) contributing positively to the model output. Features like HDL Cholesterol and Body Temperature have relatively lower impacts. The SHAP values also show feature interactions and their impact on prediction variability, and hence, it is an appropriate tool for explainable AI in medicine and medical diagnostics.

The feature importance heat map in Fig. 6 displays the correlation matrix of different features, and color intensity shows the strength and direction of associations. A value close to 1 (red) shows a strong positive correlation, and a value close to − 1 (blue) shows a strong negative correlation. Features such as heart rate (Hea) and systolic BP (Sys) are strongly and positively related, indicating that rising systolic pressure is related to heart conditions. Similarly, diastolic BP (Dia) and age are strongly and positively related, implying that pressure rises with age. Gender (Gen) and age are strongly and negatively related, implying gender-based patterns. LDL Cholesterol (LDL) and HDL Cholesterol (HDL) are weakly negatively correlated. The heatmap is useful in interpreting feature interactions, is important in predictive modeling, and demonstrates the greater influence of features on results. Figures 7 and 8 show the accuracy and loss curves, respectively, that were acquired throughout the training and testing phases. An accuracy and loss curve visually represents the model’s accuracy and loss throughout training and testing epochs. It shows the capacity of the model to distinguish heart disease between normal and risky.

Feature importance heat map.

Training and testing accuracy curve.

Training and testing loss curve.
The loss curve represents the model’s error throughout training and testing epochs. A loss curve that starts high and gradually falls throughout testing indicates higher performance, whereas a loss curve that continuously drops during training indicates more accurate predictions.
Figure 9 shows the confusion matrix of a classification model’s performance. The matrix displays true labels against the predicted labels. The model correctly classified 14,081 instances as class 0 low risk and 12,441 instances as class 1 high risk. There are merely two instances of misclassification. This implies that the model acts exceptionally well with very few errors. To verify the importance of the red fox optimizations, feature selection, the results are evaluated and compared before and after feature selection. The results are displayed in Fig. 10


Performance before and after feature selection.
The p-value attained before feature selection is 0.07, and after feature selection is 0.001. Moreover, the error rate attained before feature selection is 0.1369, and after feature selection is 0.014. Feature selection with the RFO technique improves the model’s performance by selecting the most relevant features and removing noisy and redundant information. This leads to improved generalization, reduced error rates, and improved statistical significance, as indicated by the decrease in p-value. Feature selection also improves computational efficiency to the largest possible degree, such that the model can make faster and better predictions. Overall, feature selection improves the model’s learning by improving predictive performance and accuracy. The ablation study is provided in Table 3
Performance analysis
The Python environment is used to verify the developed model’s efficacy. To analyze the model’s abilities in prediction, it is compared with metrics such as Accuracy, F score, Precision, AUC, Recall, and error rate. To assess the proposed DRFBPS model performance with other techniques, it is compared with a few existing approaches such as ML Voting Classifier (MLVC)38, ML stacking classifier MLSC39, Light Gradient Boosting Classifier (LGBC)40, Extreme Gradient Boosting with Random Forest (EGBRF)40, CNN Sparse Autoencoder (CNNSA)41 and Linear SVM (LSVM)42.
Accuracy
Accuracy is a significant performance measure used to predict heart disease risk factors. It demonstrates that the percentage of the model correctly predicts whether heart disease risk variables are present or absent. Accuracy is evaluated by Eq. (6)
$$Accuracy = \frac{CP + CA}{{CP + CA + IP + IA}}$$
(6)
here, \(CP\) denotes the correctly predicted risk present, \(CA\) denotes correctly predicted absence of risk, \(IP\) denotes incorrectly predicted the risk present, and \(IA\) denotes the incorrectly predicted absent of risk. The accuracy is compared with the existing approaches and is shown in Fig. 11

The accuracy rate achieved by the existing MLVC is 80.1%, MLSC is 90.9%, LGBC is 77.84%, EGBRF is 75.63%, CNNSA is 83.56% and LSVM is 86.43%. The developed DRFBPS achieved an accuracy of 98.6%, and with a high attained accuracy rate, the proposed model shows better performance.
Precision
Precision, also known as optimistic prediction, is a measurement used to validate the accuracy of a predictive model, especially in classification tasks like recognizing heart disease risk factors. Precision measures the proportion of correct risk forecasts among all risk forecasts. It is computed by Eq. (7)
$$Precision = \frac{CP}{{CP + IP}}$$
(7)
The precision metric is assessed and compared with existing techniques displayed in Fig. 12. High Precision indicates that the model has high accuracy in its optimistic predictions, implying that most of the occurrences it predicts as having the heart disease risk factor are correct.

The existing techniques MLVC, MLSC, LGBC, EGBRF, CNNSA, and LSVM attained a precision rate of 80.4%, 96.7%, 74.6%, 73.13%, 85.2%, and 87.5% respectively. The proposed DRFBPS model attained 94.7%, which performs better than the existing approaches.
Recall
Recall is an important metric to assess the model’s efficiency. It measures the predicted risk instances to the total actual risk. It assesses the model’s capacity to detect every risk instance accurately. It is evaluated by Eq. (8)
$$Recall = \frac{CP}{{CP + IA}}$$
(8)
High Recall ensures that most patients with heart disease are appropriately recognized. It is assessed, and the abovementioned techniques are compared in Fig. 13.

The Existing MLVC gained a recall rate of 80.1%, MLSC gained 87.6%, LGBC gained 73.26%, EGBRF gained 68.25%, CNNSA gained 82.9% and LSVM gained 85.9%. The proposed model DRFBPS gained a recall rate of 97.9%. In comparison to the other methods, DRFBPS achieved a relatively greater recall rate.
F score
Precision and Recall are combined in a statistic called the F score. It combines the Precision and recall measures and assesses the Framework’s ability to correctly predict the risk factors by avoiding errors. It is equated in Eq. (9)
$$\begin{array}{*{20}c} F & {score = 2 \times \left[ {\frac{x \times y}{{x + y}}} \right]} \\ \end{array}$$
(9)
here, \(x\) denotes the precision rate and \(y\) denotes the recall rate. The F score value for the DRFBPS model is assessed, and its comparison is shown in Fig. 14.

The F score for the existing MLVC is 80.1%, MLSC is 92.15%, LGBC is 73.93%, EGBRF is 70.61%, CNNSA is 84.05%, LVSM is 86.7% and the proposed technique attained a 97.7% F score value. The attained higher F score demonstrates better performance of the model in predicting.
Error rate
Error rate represents the proportion of incorrect predictions to the total prediction number. It is evaluated to determine the negative predictions done by the model. It describes the overall performance of the model. It is evaluated by Eq. (10)
$$\begin{array}{*{20}c} {Error} & {rate = \frac{IP + IA}{{CP + CA + IP + IA}}} \\ \end{array}$$
(10)
Figure 15 displays the error rate comparison. The error rate achieved by the existing MLVC is 0.199, MLSC is 0.091, LGBC is 0.2216, EGBRF is 0.2437, CNNSA is 0.1644, and LSVM is 0.1375. The designed model DRFBPS has an error rate of 0.014. The error rate obtained by the developed technique is lower, so it performs better.

AUC
The AUC is the performance indicator for classification models that is significant in evaluating the model’s capacity to differentiate. In performance categorization, a high AUC value is more effective. Its comparison is shown in Fig. 16.

The AUC for MLVC, MLSC, LGBC, EGBRF, CNNSA, and LVSM is 88.4%, 96.1%, 72.27%, 74.71%, 90.3% and 92.1% respectively, and the proposed DRFBPS obtained 98.2%.
P-vale and confidentiality interval
The p-value and confidence interval (CI) are important statistical validation metrics to evaluate the robustness of predictive models. P-value aids in identifying the significance of correlations between risk factors. The CI gives an interval of values. The CI is narrow for greater precision and wide for greater variability of the estimates. Both these measures combined increase the statistical validity of heart attack risk prediction, so predictive models should be accurate and generalizable to larger populations. The results of the statistical validation p-value and confidence intervals are described in the Table 4
Moreover, the overall effectiveness of the designed Framework demonstrates a better predictive technique. The entire functionality of the planned DRFBPS with current techniques is depicted in Table 4.
DRFBPS’s higher performance compared to existing models is due to its capability of capturing intricate, non-linear interactions in the data more efficiently. The model delivers a remarkable accuracy rate, which greatly surpasses existing approaches. This enhancement is seen in all the evaluation metrics, with DRFBPS having a very low error percentage among all other models, which reflects its low misclassification rate. Its high Precision and Recall also reflect its strength in true positive detection while keeping false positives and false negatives to a minimum. Its high F-score verifies its equitable performance in Precision and recall, further establishing its dependability. In addition, DRFBPS has the best AUC value, demonstrating its best discriminant ability in classifying. The low p-value indicates a statistically significant relationship. While so, the CI gives an interval of values in which the actual effect size will probably fall, generally at a 97% confidence level. The wide gap in performance implies that DRFBPS has optimized feature selection processes that maximize its predictability, generality, and resilience to intricacies in the data structure to make it the best-performing model among the compared methods.
Additionally, to verify the selection of RFO, other optimization algorithms such as the genetic Algorithm (GA), Particle swarm (PS), and Bayesian optimization (BO) are hybrid with DBN, and the results are shown in Table 5
Table 5 shows that the proposed DBN + RFO model performs the best with accuracy, which signifies an excellent selection of RFO. It also possesses the lowest error rate and highest AUC, indicating superior discrimination ability. The lowest p-value, indicating statistical significance and a high confidence interval, supports its credibility. These findings demonstrate the efficiency of the proposed DBN + RFO method in enhancing prediction accuracy.
link