DocumentCode :
2211463
Title :
Clustering categorical data: A stability analysis framework
Author :
Jarman, I.H. ; Etchells, T.A. ; Lisboa, P.J.G. ; Beynon, C.M. ; Martín-Guerrero, J.D.
Author_Institution :
Centre for Public Health, Liverpool John Moores Univ., Liverpool, UK
fYear :
2011
fDate :
11-15 April 2011
Firstpage :
58
Lastpage :
65
Abstract :
Clustering to identify inherent structure is an important first step in data exploration. The k-means algorithm is a popular choice, but K-means is not generally appropriate for categorical data. A specific extension of k-means for categorical data is the k-modes algorithm. Both of these partition clustering methods are sensitive to the initialization of prototypes, which creates the difficulty of selecting the best solution for a given problem. In addition, selecting the number of clusters can be an issue. Further, the k-modes method is especially prone to instability when presented with `noisy´ data, since the calculation of the mode lacks the smoothing effect inherent in the calculation of the mean. This is often the case with real-world datasets, for instance in the domain of Public Health, resulting in solutions that can be radically different depending on the initialization and therefore lead to different interpretations. This paper presents two methodologies. The first addresses sensitivity to initializations using a generic landscape mapping of k-mode solutions. The second methodology utilizes the landscape map to stabilize the partition clusters for discrete data, by drawing a consensus sample in order to separate signal from noise components. Results are presented for the benchmark soybean disease dataset, an artificially generated dataset and a case study involving Public Health data.
Keywords :
pattern clustering; stability; categorical data clustering; generic landscape mapping; k-means algorithm; k-modes algorithm; partition clustering method; public health; stability analysis framework; Algorithm design and analysis; Clustering algorithms; Diseases; Frequency measurement; Noise; Noise measurement; Partitioning algorithms; categorical data; clustering; data mining;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Intelligence and Data Mining (CIDM), 2011 IEEE Symposium on
Conference_Location :
Paris
Print_ISBN :
978-1-4244-9926-7
Type :
conf
DOI :
10.1109/CIDM.2011.5949452
Filename :
5949452
Link To Document :
بازگشت