Abstract:
The Density Peaks Clustering (DPC) algorithm is a simple and efficient clustering algorithm that has garnered wide attention for its ability to intuitively and quickly identify clusters within datasets. However, the DPC algorithm's definition of local density does not account for the impact of density differences between clusters, which can easily lead to the incorrect selection of cluster centers. Additionally, the use of a chain assignment strategy can result in erroneous cascading effects, and the algorithm requires the calculation of Euclidean distances between all samples, leading to high time complexity. To address these issues, this paper proposes a Density Peaks Clustering algorithm with circle-division sampling and reverse nearest neighbors. The proposed algorithm employs circular grid sampling to obtain representative samples, thereby reducing the number of samples that need to be computed and lowering the algorithm's computational time overhead. It introduces an approximate K-nearest neighbor strategy to strengthen the connection between representatives and initial samples, minimizing the loss of clustering accuracy caused by sampling. The algorithm uses reverse nearest neighbors to optimize the definition of local density, adjusting the local density of samples to accurately identify density peaks. By calculating similarity through shared reverse nearest neighbors and assigning representatives based on the similarity matrix, it avoids the erroneous cascading effects caused by sample assignment strategies. The algorithm was tested through a series of experiments on synthetic datasets with complex shapes, real datasets, and larger-scale datasets. Experimental results indicate that the proposed algorithm exhibits significant clustering advantages for complex shapes, real, and large-scale datasets, with notable improvements in accuracy and efficiency compared to the DPC algorithm and its improved versions.