圆形网格抽样和逆近邻优化的密度峰值聚类算法

Density peaks clustering algorithm with circle-division sampling and reverse nearest neighbor optimization

  • 摘要: 密度峰值聚类(DPC)算法是一种简单高效的聚类算法,因其可直观和快速发现数据集中的类簇而得到广泛关注。但DPC算法局部密度定义未考虑类簇间密度差异影响,易误选类簇中心;使用链式分配策略,易产生错误连带效应;需计算所有样本间的欧氏距离,算法的时间复杂度较高。因此,本文提出一种圆形网格抽样和逆近邻优化的密度峰值聚类算法。该算法采用圆形网格抽样得到代表以减少需要计算的样本数,降低算法计算的时间开销;引入近似 近邻策略加强代表和初始样本的联系,减少抽样导致的聚类精度丢失;利用逆近邻优化局部密度定义策略,调节样本局部密度的大小,准确找到密度峰值;通过共享逆近邻计算相似性,由相似性矩阵分配代表,避免样本分配策略产生的错误连带效应。设置了复杂形态合成数据集、真实数据集和较大规模数据集进行分组实验。实验结果表明,本文算法对复杂形态、真实和较大规模数据集的聚类优势明显,精度及效率相对DPC算法及其改进算法均有较大提升。

     

    Abstract: The Density Peaks Clustering (DPC) algorithm is a simple and efficient clustering algorithm that has garnered wide attention for its ability to intuitively and quickly identify clusters within datasets. However, the DPC algorithm's definition of local density does not account for the impact of density differences between clusters, which can easily lead to the incorrect selection of cluster centers. Additionally, the use of a chain assignment strategy can result in erroneous cascading effects, and the algorithm requires the calculation of Euclidean distances between all samples, leading to high time complexity. To address these issues, this paper proposes a Density Peaks Clustering algorithm with circle-division sampling and reverse nearest neighbors. The proposed algorithm employs circular grid sampling to obtain representative samples, thereby reducing the number of samples that need to be computed and lowering the algorithm's computational time overhead. It introduces an approximate K-nearest neighbor strategy to strengthen the connection between representatives and initial samples, minimizing the loss of clustering accuracy caused by sampling. The algorithm uses reverse nearest neighbors to optimize the definition of local density, adjusting the local density of samples to accurately identify density peaks. By calculating similarity through shared reverse nearest neighbors and assigning representatives based on the similarity matrix, it avoids the erroneous cascading effects caused by sample assignment strategies. The algorithm was tested through a series of experiments on synthetic datasets with complex shapes, real datasets, and larger-scale datasets. Experimental results indicate that the proposed algorithm exhibits significant clustering advantages for complex shapes, real, and large-scale datasets, with notable improvements in accuracy and efficiency compared to the DPC algorithm and its improved versions.

     

/

返回文章
返回