DocumentCode :
2243355
Title :
Enhanced co-occurrence distances for categorical data in unsupervised learning
Author :
Feng, Jia-yi ; Wang, Ming-chun ; Wang, Can ; Cao, Long-bing
Author_Institution :
Dept. of Sci., Tianjin Univ. of Technol. & Educ., Tianjin, China
Volume :
4
fYear :
2010
fDate :
11-14 July 2010
Firstpage :
2071
Lastpage :
2078
Abstract :
Distance metrics for categorical data play an important role in unsupervised learning such as clustering. They also dramatically affect learning accuracy and computational complexities. Recently, two co-occurrence methods, Co-occurrence Distance based on Power Set (CDPS) and Co-occurrence Distance based on Universal Set (CDUS), have been proposed to calculate distances for categorical attribute values with significantly improved clustering accuracy by taking advantage of co-occurrences of attributes. However, their computational load is high enough to restrict their applications in unsupervised learning. This paper proposes two new enhanced co-occurrence approaches, i.e. Co-occurrence Distance based on Join Set (CDJS) and Co-occurrence Distance based on Intersection Set (CDIS), to calculate the distance between two values of a categorical attribute by considering its relationships to other attributes. Theoretical analysis reveals the equivalent accuracy of CDJS and CDIS to CDPS and CDUS, while CDJS and CDIS can significantly reduce computational complexity. Substantial experiments on ten benchmark and real-world data sets have evidenced that our proposed approaches are equivalently accurate but with a much higher efficiency than CDPS and CDUS, in particular for large scale data sets.
Keywords :
pattern clustering; unsupervised learning; categorical attribute values; categorical data; co-occurrence distance metrics; data clustering; intersection set co-occurrence distance method; join set co-occurrence distance method; power set co-occurrence distance method; universal set co-occurrence distance method; unsupervised learning; Accuracy; Computational complexity; Cybernetics; Iterative closest point algorithm; Machine learning; Measurement; Unsupervised learning; Categorical data; Co-occurrence distances; Unsupervised learning; clustering;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Cybernetics (ICMLC), 2010 International Conference on
Conference_Location :
Qingdao
Print_ISBN :
978-1-4244-6526-2
Type :
conf
DOI :
10.1109/ICMLC.2010.5580500
Filename :
5580500
Link To Document :
بازگشت