基于聚类欠采样的集成不均衡数据分类算法

Imbalanced data ensemble classification based on cluster-based under-sampling algorithm

  • 摘要: 传统的分类算法大多假设数据集是均衡的,追求整体的分类精度.而实际数据集经常是不均衡的,因此传统的分类算法在处理实际数据集时容易导致少数类样本有较高的分类错误率.现有针对不均衡数据集改进的分类方法主要有两类:一类是进行数据层面的改进,用过采样或欠采样的方法增加少数类数据或减少多数类数据;另一个是进行算法层面的改进.本文在原有的基于聚类的欠采样方法和集成学习方法的基础上,采用两种方法相结合的思想,对不均衡数据进行分类.即先在数据处理阶段采用基于聚类的欠采样方法形成均衡数据集,然后用AdaBoost集成算法对新的数据集进行分类训练,并在算法集成过程中引用权重来区分少数类数据和多数类数据对计算集成学习错误率的贡献,进而使算法更关注少数数据类,提高少数类数据的分类精度.

     

    Abstract: Most traditional classification algorithms assume the data set to be well-balanced and focus on achieving overall classification accuracy. However, actual data sets are usually imbalanced, so traditional classification approaches may lead to classification errors in minority class samples. With respect to imbalanced data, there are two main methods for improving classification performance. The first is to improve the data set by increasing the number of minority class samples by over-sampling and decreasing the number of majority class samples by under-sampling. The other method is to improve the algorithm itself. By combining the cluster-based under-sampling method with ensemble classification, in this paper, an approach was proposed for classifying imbalanced data. First, the cluster-based under-sampling method is used to establish a balanced data set in the data processing stage, and then the new data set is trained by the AdaBoost ensemble algorithm. In the integration process, when calculating the error rate of integrated learning, this algorithm uses weights to distinguish minority class data from majority class data. This makes the algorithm focus more on small data classes, thereby improving the classification accuracy of minority class data.

     

/

返回文章
返回