分类属性高维数据基于集合差异度的聚类算法

Clustering algorithm based on set dissimilarity for high dimensional data of categorical attributes

摘要: 提出基于集合差异度的聚类算法.算法通过定义的集合差异度和集合精简表示,直接进行一个集合内所有对象总体差异程度的计算,而不必计算两两对象间的距离,并且在不影响计算精确度的情况下对分类属性高维数据进行高度压缩,只需一次数据扫描即得到聚类结果.算法计算时间复杂度接近线性.实例表明该算法是有效的.

Abstract: A clustering algorithm is proposed based on set dissimilarity. Through defining set dissimilarity and set reduction, it does not calculate the distance between each pair of objects but computes the general dissimilarity of all the objects in a set directly, reduces high-dimensional categorical data enormously without loss of computation accuracy and gets the clustering result by only once data scanning. The time complexity of the algorithm is almost linear. An example of real data shows that the clustering algorithm is effective.