Abstract:
Clustering is an important task in the field of data mining. Most clustering algorithms can effectively deal with the clustering problems of balanced datasets, but their processing ability is weak for imbalanced datasets. For example, K–means, a classical partition clustering algorithm, tends to produce a “uniform effect” when dealing with imbalanced datasets, i.e., the K–means algorithm often produces clusters that are relatively uniform in size when clustering unbalanced datasets with the data objects in small clusters “swallowing” the part of the data objects in large clusters. This means that the number and density of the data objects in different clusters tend to be the same. To solve the problem of “uniform effect” generated by the classical K–means algorithm in the clustering of imbalanced data, a clustering algorithm based on nearest neighbor (CABON) is proposed for imbalanced data. Firstly, the initial clustering of data objects is performed to obtain the undetermined-cluster set, which is defined as a set that consists of the data objects that must be checked further regarding the clusters in which they belong. Then, from the edge to the center of the set, the nearest-neighbor method is used to reassign the data objects in the undetermined-cluster set to the clusters of their nearest neighbors. Meanwhile the undetermined-cluster set is dynamically adjusted, to obtain the final clustering result, which prevents the influence of the “uniform effect” on the clustering result. The clustering results of the proposed algorithm is compared with that of K–means, the imbalanced K–means clustering method with multiple centers (MC_IK), and the coefficient of variation clustering for non-uniform data (CVCN) on synthetic and real datasets. The experimental results reveal that the CABON algorithm effectively reduces “uniform effect” generated by the K–means algorithm on imbalanced data, and its clustering result is superior to that of the K–means, MC_IK, and CVCN algorithms.