分类属性数据聚类算法HABOS

HABOS clustering algorithm for categorical data

  • 摘要: CABOSFV_C是一种针对分类属性高维数据的高效聚类算法,该算法采用集合稀疏差异度进行距离计算,并采用稀疏特征向量实现数据压缩.该算法的聚类效果受集合稀疏差异度上限参数的影响,而该参数的选取没有明确的指导.针对该问题提出基于集合稀疏差异度的启发式分类属性数据层次聚类算法(heuristic hierarchical clustering algorithm of categorical data based on sparse feature dissimilarity,HABOS),该方法从聚结型层次聚类思想的角度出发,在聚类数上限参数的约束下,应用新的内部聚类有效性评价指标(clustering validation index based on sparse feature dissimilarity,CVISFD)进行启发式度量,从而实现对聚类层次的自动选取.UCI基准数据集的实验结果表明,HABOS有效地提高了聚类准确性和稳定性.

     

    Abstract: The clustering algorithm based on sparse feature vector for categorical attributes(CABOSFVC) is an efficient high-dimensional clustering method for categorical data. Sparse feature dissimilarity(SFD) is used to calculate the distance and sparse feature vector is used to achieve data compression. However,CABOSFVC algorithm is dependent upon SFD upper limit parameter for which there is no guidance for configuration. Aimed at solving the problem that CABOSFVC algorithm is sensitive to this parameter,a new heuristic hierarchical clustering algorithm of categorical data based on SFD(HABOS) was proposed in this paper. With the constraint of the upper limit number of clusters,this algorithm applied agglomerative hierarchical clustering and the new internal clustering validation index based on SFD(CVISFD) which was used to measure the results heuristically to achieve the best choice of the clustering level. Three UCI benchmark data sets were used to compare the improved algorithm with the traditional ones. The empirical tests show that HABOS increases the clustering accuracy and stability effectively.

     

/

返回文章
返回