Abstract:
With the intensification of global warming, climate change has impacted every aspect of the occurrence, transmission, and variation of infectious diseases. The adverse effects of weather-related infectious diseases on human health have become a major public concern worldwide. Accurate and reliable forecasting of daily hand-foot-and-mouth disease (HFMD) cases is imperative for promptly implementing preventive and timely intervention measures. In order to address the issues of low accuracy and poor interpretability in existing HFMD incidence prediction models, in this paper, we propose an interpretable prediction model, namely, ARIMA–LSTM–XGBoost, which integrates multiple meteorological factors with Autoregressive integrated moving average model (ARIMA), Long short-term memory (LSTM), Extreme gradient boosting (XGBoost), Grey wolf optimizer (GWO), Genetic algorithm (GA) and Shapley additive explanations (SHAP). This model takes into account the potential impact of meteorological factors on HFMD incidence rates, aiming to achieve precise prediction of HFMD incidence trends and effective analysis of the key underlying influencing factors through multi-dimensional and multi-layered algorithm integration. Firstly, the ARIMA model is utilized to analyze historical HFMD incidence data to capture linear trends. Through differencing, autoregression, and moving average operations, the ARIMA model effectively extracts structural features from time-series data and generates initial prediction results, along with residual sequences. These residual sequences contain complex information that the ARIMA model fails to fully capture, providing a foundation for the subsequent nonlinear analysis. Secondly, based on the residual data left by the ARIMA model, LSTM is introduced to capture the potential complex nonlinear relationships and long-term dependencies. LSTM networks are particularly suitable for addressing long-term memory issues in time-series data. To further enhance the LSTM performance, the GWO is employed to adaptively optimize the key parameters of the LSTM. Thirdly, to fully leverage the advantages of XGBoost in handling nonlinear relationships and high-dimensional data while overcoming its complexities in parameter tuning and slower convergence, the GA is used to optimize the parameters of XGBoost. By simulating the selection, crossover, and mutation mechanisms in biological evolution, the GA efficiently searches for optimal solutions in the parameter space, thereby optimizing the performance of XGBoost. Finally, the prediction results of ARIMA–LSTM are fused with XGBoost by using a reciprocal error weighting method to improve the overall prediction accuracy. Meanwhile, the SHAP method is used to analyze the feature importance and enhance the interpretability of the proposed model. SHAP provides a fair and consistent approach to assess the contribution of each feature to the model's prediction results. It not only aids in understanding which factors are most critical for HFMD incidence prediction, but also quantifies the degree of influence of these factors, thereby enhancing the interpretability of the model. Based on daily HFMD incidence and meteorological monitoring data from a southern city between 2014 and 2019, comparative experiments were conducted to evaluate the performance of the proposed model in predicting HFMD incidence. The experimental results demonstrate that the ARIMA–LSTM–XGBoost model achieves a significantly improved prediction accuracy compared to other machine learning prediction models. This model not only accurately predicts HFMD incidence trends, but also effectively identifies key meteorological factors influencing the incidence, providing a scientific basis for public health decision-making.