Title :
Scaling Information-Theoretic Text Clustering: A Sampling-based Approximate Method
Author :
Zhexi Xu ; Zhiang Wu ; Jie Cao ; Hengnong Xuan
Author_Institution :
Sch. of Inf. Eng., Nanjing Univ. of Finance & Econ., Nanjing, China
Abstract :
Info-Kmeans, a K-means clustering method employing KL-divergence as the proximity function, is one of the representative methods in information-theoretic clustering. With the explosive growth of online texts such as online reviews and user-generated content, the text is becoming more sparse and much bigger, which poses significant challenges on both effectiveness and efficiency issues of text clustering. In our prior work, we presented a Summation-bAsed Incremental Learning (SAIL) algorithm, which can avoid the zero-feature dilemma of highly sparse texts. In this paper, we propose a sampling-based approximate approach for scaling SAIL algorithm to deal with the large-scale of texts. Particularly, an instance-level random sampling is invoked to reduce the number of instances to be examined during each iteration, which substantially speeds up the clustering on big text data. Furthermore, we prove that the margin of errors introduced by random sampling can be controlled in a small range. Extensive experiments on eight real-life text datasets demonstrate the advantage of the proposed sampling-based approximate clustering method. In particular, our method shows merits in both effectiveness and efficiency on clustering performance.
Keywords :
Big Data; information theory; pattern clustering; sampling methods; text analysis; SAIL algorithm; information-theoretic text clustering; instance-level random sampling; sampling-based approximate clustering method; summation-based incremental learning; Algorithm design and analysis; Approximation algorithms; Clustering algorithms; Clustering methods; Indexes; Linear programming; Wireless application protocol; K-means; KL-divergence; Random; Text Clustering;
Conference_Titel :
Advanced Cloud and Big Data (CBD), 2014 Second International Conference on
Print_ISBN :
978-1-4799-8086-4
DOI :
10.1109/CBD.2014.56