面向材料数据的主动回归学习方法

Active regression learning method for material data

  • 摘要: 材料的生产环境和测量条件不同,导致用于机器学习的材料数据的噪声较大。对材料数据进行标注需要一定的专业知识和专业技能,因此标注成本也相对较高。这两方面的因素给机器学习应用于材料领域带来了巨大挑战。为应对这个挑战,提出了一个主动回归学习方法,由离群点检测模块、贪婪采样模块和最小变化采样模块组成。同其他主动学习方法相比,该方法整合了离群点检测机制,选取高质量样本的同时有效地排除了噪声数据的影响,避免了沉没成本。在公开数据集和非公开数据集上与最新的主动回归学习方法进行了对比实验,实验结果表明本文方法在相同的数据量下训练的任务模型性能指标相比于其他模型平均提高15%,且只需30%~40%的数据量作为训练集就可以达到甚至超过使用全部数据训练任务模型的精度。

     

    Abstract: To date, artificial intelligence has been successfully applied in various fields of material science, but these applications require a large amount of high-quality data. In practical applications, many unlabeled data points but few labeled data points can be obtained directly. The reason is that data annotations require fine and expensive experiments, and the cost of time and money cannot be ignored. Active learning can select a few high-quality samples from many unlabeled data points for labeling and use as little labeling cost as possible to optimize task model performance. However, active learning methods suitable for material attribute regression are poorly understood, and the general active learning method cannot easily avoid the negative effects of noise data, resulting in decreased costs. Therefore, we propose a new active regression learning method that includes the following features: (1) outlier detection module: using the labeled data prediction from a task model trained to fit and the labeled dataset to train the auxiliary classification model for classifying outliers and then excluding the samples that are most likely to be outliers in the unlabeled dataset; (2) greedy sampling: an iterative method is adopted to select the data farthest from the data in the labeled dataset and the selected data in the geometric space to fully consider sample diversity; and (3) minimum change sampling: selecting the unlabeled data with minimum change before and after the task model, which is trained on the labeled dataset. This part of the data is relatively lacking in the feature space of the labeled dataset. We performed experiments on the concrete slump test dataset and the negative coefficient of thermal expansion dataset and compared our method with the latest active regression learning methods. The results show that other methods do not necessarily improve task model performance after labeling data in each active learning circle on noisy datasets, and the final performance cannot reach the level of the task model trained by all data. Under the same amount of data, the performance index of the task model trained by our method is improved by 15% on average compared with other models. Because of the addition of an outlier detection mechanism, our method can effectively avoid sampling outliers when selecting high-quality samples. The task model trained using only 30%–40% of the data can achieve or even exceed the accuracy of the task model trained by all data.

     

/

返回文章
返回