DocumentCode
589261
Title
Scalable Overlapping Co-clustering of Word-Document Data
Author
Franca, F.O.D.
Author_Institution
Center of Math., Comput. & Cognition (CMCC), Fed. Univ. of ABC (UFABC), Santo Andre, Brazil
Volume
1
fYear
2012
fDate
12-15 Dec. 2012
Firstpage
464
Lastpage
467
Abstract
Text clustering is used on a variety of applications such as content-based recommendation, categorization, summarization, information retrieval and automatic topic extraction. Since most pair of documents usually shares just a small percentage of words, the dataset representation tends to become very sparse, thus the need of using a similarity metric capable of a partial matching of a set of features. The technique known as Co-Clustering is capable of finding several clusters inside a dataset with each cluster composed of just a subset of the object and feature sets. In word-document data this can be useful to identify the clusters of documents pertaining to the same topic, even though they share just a small fraction of words. In this paper a scalable co-clustering algorithm is proposed using the Locality-sensitive hashing technique in order to find co-clusters of documents. The proposed algorithm will be tested against other co-clustering and traditional algorithms in well known datasets. The results show that this algorithm is capable of finding clusters more accurately than other approaches while maintaining a linear complexity.
Keywords
data structures; pattern clustering; text analysis; dataset representation; locality-sensitive hashing technique; scalable overlapping coclustering; text clustering; word-document data clustering; Accuracy; Clustering algorithms; Complexity theory; Feature extraction; Machine learning; Mutual information; Text mining; co-clustering; hashing; text clustering;
fLanguage
English
Publisher
ieee
Conference_Titel
Machine Learning and Applications (ICMLA), 2012 11th International Conference on
Conference_Location
Boca Raton, FL
Print_ISBN
978-1-4673-4651-1
Type
conf
DOI
10.1109/ICMLA.2012.84
Filename
6406666
Link To Document