融合改进LSTM与XGBoost的可解释性手足口病发病预测模型

Interpretable prediction model for hand-foot-and-mouth disease incidence based on improved LSTM and XGBoost

  • 摘要: 针对现有手足口病发病预测模型的准确率较低且可解释性较差的问题,综合多种气象因素,提出一种基于自回归差分移动平均模型(Autoregressive integrated moving average model, ARIMA)、长短时记忆网络(Long short-term memory, LSTM)、极度梯度提升树(Extreme gradient boosting, XGBoost)、灰狼优化算法(Grey wolf optimizer, GWO)、遗传算法(Genetic algorithm, GA)和沙普利加和解释(Shapley additive explanations, SHAP)的可解释性预测模型ARIMA–LSTM–XGBoost. 首先,使用ARIMA模型捕捉数据中的线性趋势进行预测并得到残差数据;其次,将残差数据输入到LSTM神经网络中,并采用GWO算法对LSTM算法中的关键参数进行自适应寻优,以捕捉复杂的非线性关系和长期依赖性;再次,利用GA算法的全局搜索能力,对XGBoost算法的参数进行优化,弥补XGBoost收敛较慢的缺陷;最后,采用误差倒数法对改进的ARIMA–LSTM与XGBoost算法进行融合,以提升模型的预测准确度,并使用SHAP方法对该模型的特征重要性进行归因和可解释性分析. 基于南方某城市2014—2019年手足口病日发病数及气象监测数据,进行了手足口病发病数预测的对比实验,结果表明,相比于其他机器学习预测模型,ARIMA–LSTM–XGBoost模型具有更高的预测准确率,能够准确地预测手足口病发病数以及高效地发现手足口病患病的潜在特征.

     

    Abstract: With the intensification of global warming, climate change has impacted every aspect of the occurrence, transmission, and variation of infectious diseases. The adverse effects of weather-related infectious diseases on human health have become a major public concern worldwide. Accurate and reliable forecasting of daily hand-foot-and-mouth disease (HFMD) cases is imperative for promptly implementing preventive and timely intervention measures. In order to address the issues of low accuracy and poor interpretability in existing HFMD incidence prediction models, in this paper, we propose an interpretable prediction model, namely, ARIMA–LSTM–XGBoost, which integrates multiple meteorological factors with Autoregressive integrated moving average model (ARIMA), Long short-term memory (LSTM), Extreme gradient boosting (XGBoost), Grey wolf optimizer (GWO), Genetic algorithm (GA) and Shapley additive explanations (SHAP). This model takes into account the potential impact of meteorological factors on HFMD incidence rates, aiming to achieve precise prediction of HFMD incidence trends and effective analysis of the key underlying influencing factors through multi-dimensional and multi-layered algorithm integration. Firstly, the ARIMA model is utilized to analyze historical HFMD incidence data to capture linear trends. Through differencing, autoregression, and moving average operations, the ARIMA model effectively extracts structural features from time-series data and generates initial prediction results, along with residual sequences. These residual sequences contain complex information that the ARIMA model fails to fully capture, providing a foundation for the subsequent nonlinear analysis. Secondly, based on the residual data left by the ARIMA model, LSTM is introduced to capture the potential complex nonlinear relationships and long-term dependencies. LSTM networks are particularly suitable for addressing long-term memory issues in time-series data. To further enhance the LSTM performance, the GWO is employed to adaptively optimize the key parameters of the LSTM. Thirdly, to fully leverage the advantages of XGBoost in handling nonlinear relationships and high-dimensional data while overcoming its complexities in parameter tuning and slower convergence, the GA is used to optimize the parameters of XGBoost. By simulating the selection, crossover, and mutation mechanisms in biological evolution, the GA efficiently searches for optimal solutions in the parameter space, thereby optimizing the performance of XGBoost. Finally, the prediction results of ARIMA–LSTM are fused with XGBoost by using a reciprocal error weighting method to improve the overall prediction accuracy. Meanwhile, the SHAP method is used to analyze the feature importance and enhance the interpretability of the proposed model. SHAP provides a fair and consistent approach to assess the contribution of each feature to the model's prediction results. It not only aids in understanding which factors are most critical for HFMD incidence prediction, but also quantifies the degree of influence of these factors, thereby enhancing the interpretability of the model. Based on daily HFMD incidence and meteorological monitoring data from a southern city between 2014 and 2019, comparative experiments were conducted to evaluate the performance of the proposed model in predicting HFMD incidence. The experimental results demonstrate that the ARIMA–LSTM–XGBoost model achieves a significantly improved prediction accuracy compared to other machine learning prediction models. This model not only accurately predicts HFMD incidence trends, but also effectively identifies key meteorological factors influencing the incidence, providing a scientific basis for public health decision-making.

     

/

返回文章
返回