Prediction and analysis of COVID-19 daily new cases and cumulative cases: times series forecasting and machine learning models | BMC Infectious Diseases
Data collection
This article is based on the official WHO website, and MS Excel 2019 was used to build a COVID-19 time-series database. To create a stable and effective ARIMA model, at least 30 observations are required [32]. Cumulative cases and daily confirmed cases from the three countries of the USA, India, and Brazil, as of May 1, 2020, through November 30, 2021, were selected for train data of the construction of disease prediction models and the cumulative cases and daily confirmed cases of those three countries during next 30 days (December 1, 2021, to December 30, 2021) will be forecasted by fitted models. A statistical description of this raw data is presented in Table 1. Forecast the model prediction performance for confirmed case data for the next month with 95% relative confidence intervals (December 1, 2021–December 30, 2021).
SARIMA and ARIMA model
ARIMA is a type of algorithm for the analysis and forecasting of time series data, namely the Box—Jenkin model, first proposed by Box and Jenkins in the 1970s [32]. The ARIMA (p, d, q) model is known as the differential autoregressive moving average model. Due to the seasonal feature of the raw data, the SARIMA model (seasonal autoregressive integrated moving average), as an extension of ARIMA, is also often used for time series forecasting after seasonal adjustment. Such model is to apply mathematical models to non-stationary time series after smoothing the data, which is used to estimate and extrapolate the state of something at some point in the future by analyzing the pattern of historical data and making future predictions based on that pattern and historical data from the past and the present [33]. The cumulative number of confirmed cases and daily new cases of COVID-19 is a random series with nonlinear or seasonal character, so the model can be considered suitable for forecasting. ARIMA simulates and estimates the state of something at some point in the future. The ARIMA model includes the following steps [34]: Step 1: Assessment of the model; Step 2: The model parameters were estimated; Step 3: Check the hypotheses of the model validation; Step 4: Modeling predictions.The structure of the ARIMA (p, d, q) model is Eq. (1).
$$y_t=\varnothing _1y_t-1+\varnothing _2y_t-2+\dots +\varnothing _py_t-p+e_t-\theta _1e_t-1-\theta _2e_t-2-\dots -\theta _qe_t-q$$
(1)
In Eq. (1), ϕa(a = 1,2,…,p) and \(\theta\) b(b = 0,1,2,…,q) are parameters of the model. yt and ɛt represent the original value and arbitrary error at time step t. The arbitrary error represented by ɛt represents σ2 with zero mean and standard deviation. Taking the value q = 0 in Eq. (1) works as A.R. model with order p, and for p = 0, it becomes the M.A. model with q order. So (p, q) are both important factors to determine the ARIMA model.
The Prophet model
The Prophet is a powerful and fast open-source time series model developed by Facebook. which could well handle the impact of missing values and outliers in the time series on the prediction and is suitable for the prediction analysis of the COVID-19 epidemic[35,36,37]. They are combined in the following equation.
$$Y\left(t\right)=g\left(t\right)+s\left(t\right)+h\left(t\right)+\varepsilon _t$$
(2)
where \(Y\left(t\right)\) indicates the trend indicator data at time t; \(g\left(t\right)\) indicates the trend term and is the portion of the time series in which there is a non-cyclical trend of change; \(s\left(t\right)\) indicates the period term and is the portion of the time series that exhibits a periodicity of change; \(h\left(t\right)\) indicates a holiday term and is the portion of the sequence that is affected by holidays and, since data from this study do not have an effect of the holiday term in trend projections, this one was not considered; \(\varepsilon _t\) It is an error term which accounts for any unusual changes not accommodated by the model. \(\varepsilon _t\) denotes errors due to unusual changes.
Prophet uses the Fourier series to forecast the seasonality effects, and the seasonality models are specified as the periodic functions of t[38, 39]. The arbitrary smoothing of seasonal effects with a scaling time variable using Fourier series is represented as:
$$s\left(t\right)=\sum _n=1^\infty a_n\mathrmcos\frac2n\pi tp+b_n\mathrmsin\frac2n\pi tP$$
(3)
where P is the period and, for a given value of N, to fit the seasonality model, the parameters a1, a2,…, an and b1, b2,…, bn need to be estimated.
Analytical tools and model evaluation
ACF and PACF test
The ACF is a complete autocorrelation function that provides us with the autocorrelation value for any sequence with lag values. In brief, it describes the degree of correlation between the current value of that sequence and its past value. PACF is a partial autocorrelation function. Rather than finding correlations of lags like ACF with the current, it finds correlations of the residuals with the next lag value. An ACF shows the linear relationship between the observations at time t and previous observations at time t − n. The ACF and PACF for a given time series X can be defined as:
$$\mathrmACF\left(X_t,X_t-n\right)=\fracCovariance\left(X_t,X_t-n\right)Variance\left(X_t\right)$$
(4)
$$\mathrmPACF\left(X_t,X_t-2\right)=\fracCovariance\left(X_t,X_t-2/X_t-1\right)\sqrtVariance\left(X_t/X_t-1\right)\sqrtVariance\left(X_t-2/X_t-1\right) $$
(5)
where in the ACF plot, n is the lag (or difference between \(X_t\) and \(X_t-n\)); in the PACF plot between observed values \(X_t and X_t-2\), n = 2.
Performance indices
Three indexes were employed in accessing model fitting and forecasting efficiency: namely Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE), and were applied to test the predictive accuracy of the developed models. Lower RMSE, MAE, and MAPE values indicate a better data fit. The formulations of these criteria are expressed Eqs. (6)–(8), respectively [40]. A logarithmic approach may be necessary to make the time series stationary after differencing. This approach takes the log value of each point, followed by differencing. Bayesian information criterion (BIC) is a class of information criteria to measure the goodness of fit of a statistical model. It builds on the concept of entropy and can weigh the complexity of the estimated model against the goodness of fit of this model to the data. This information helps assess the model’s parameters and how well the model performed. In this study, to prevent the excessive model complexity caused by the excessive model accuracy. Therefore, the function sets the lower value.
$$\beginarraycRMSE=\sqrt\fracSSEn =\sqrt{\frac\sum_i=1^n\left(Y_i-\overlineY _i\right)^2n}\endarray$$
(6)
$$\mathrmMAE=\frac1n\sum_i=1^n(Y_i-\overlineY _i)$$
(7)
$$\beginarraycMAPE=\frac100n\times \sum_i=1^n\left|\frac\left(Y_i-\overlineY _i\right)Y_i\right|\endarray$$
(8)
$$\mathrmBIC=-2\mathrmlogL\left(\widehat\theta \right)+n\mathrmlogN$$
(9)
In Eq. (6), (7), and (8), where \(\mathrmY_i\) is the actual expected output, \(\overlineY _i\) Is the model’s prediction, i = 1…n and n is the number of observations. In Eq. (9),\(\mathrmlogL\left(\widehat\theta \right)\) is the likelihood function, N is the number of observations, and n is the number of model parameters.
Data analysis
Since the new confirmed cases of COVID-19 has periodically or Seasonal characteristics. The SARIMA model and Prophet model were used to predict next 30 days COVID daily new cases and comfirmed cases data, and the Prophet and SARIMA model were constructed for the prediction of daily cumulative cases. The three models are used for the forecast and simulations of this study based on R 4.1.1 software with forecast and prophet package. Before applying the prediction model, we use logarithmic conversion to process the original data to make the time series more stable and weaken the collinearity of the model, so as to improve the accuracy of prediction.Due to the periodicity of daily new cases, the seasonal components are eliminated,Considering that the daily number of new cases in COVID-19 has the characteristics of periodicity and seasonality,hence, the ARIMA and Prophet model are constructed for the cumulative confirmed case data and, in addition, the SARIMA and prophet model are applied for the daily new confirmed cases.
link