Abstract:
Hydrogen embrittlement (HE) poses a serious threat to the structural integrity and service reliability of low-alloy steels, constituting one of the key factors limiting their long-term service performance in critical engineering applications. The diffusion, trapping, and accumulation of hydrogen atoms within materials lead to embrittlement of the microstructure and crack initiation, significantly reducing ductility and fracture toughness. HE behavior is typically influenced by the coupled effects of microstructure and environmental parameters such as temperature, pressure, and hydrogen concentration, exhibiting high nonlinearity and complexity. However, experiments are often costly, time-consuming, and limited in reproducibility, resulting in small sample sizes for available datasets. Issues such as data scarcity and uneven feature distribution are prevalent, making it challenging for existing machine learning models to achieve accurate predictions under limited sample conditions. In recent years, data augmentation techniques have been increasingly introduced into materials science to mitigate data scarcity under small-sample conditions. Data augmentation methods expand the dataset by statistically perturbing and synthesizing original samples while preserving the consistency of the data distribution. Such methods have demonstrated promising results in alloy design, fatigue life prediction, and corrosion modeling. For the hydrogen embrittlement issue of low-alloy steels, this study proposes a Quantile Gaussian Data Augmentation with Multi-model Learning (QGDAM) method for small-sample HE behavior, aiming to achieve robust learning and high-precision prediction under limited data conditions. The method comprises two modules: data augmentation and regression prediction. During the data augmentation phase, a Quantile Transformation Module is introduced to mitigate skewed feature distributions. Three data augmentation strategies based on Gaussian Mixture Models (GMM) are designed to generate augmented samples that closely match the distribution of the original data. During prediction, a multi-model ensemble regression framework was established, incorporating Random Forest (RF), Gradient Boosting (GB), Light Gradient Boosting (LightGBM) and K-Nearest Neighbors (KNN). This study extracted 90 valid samples from publicly available HE experimental data to construct a low-alloy steel HE behavior database containing 17 key features. These features encompass material strength parameters, environmental conditions, and elemental composition. Results demonstrate that the QGDAM method significantly outperforms traditional machine learning approaches in both enhanced sample quality and prediction accuracy. Compared to existing methods, this study demonstrates higher sample quality across four sample augmentation ratios. Compared to the baseline model, QGDAM achieves a significant reduction in mean squared error (MSE) and an average increase of 0.18 in the coefficient of determination (R2). Additionally, it significantly improves prediction accuracy on two external validation sets, indicating strong generalization capability and robustness. Furthermore, this paper compares the feature-response relationships of models before and after augmentation using Partial Dependence Plots (PDP) and Shapley Additive Explanations (SHAP). Results show that the augmented model more accurately captures the influence patterns of feature variable changes on HE sensitivity. In contrast, the dependency curves of the original dataset exhibited scattered distributions and weaker regularity, indicating that QGDAM significantly enhances the model’s ability to fit real physical mechanisms at the feature learning level. Comprehensive results demonstrate that the proposed QGDAM method effectively improves the accuracy and interpretability of HE behavior prediction under small-sample conditions. This provides a generalizable data-driven approach for intelligent modeling in the service performance of complex materials.