A New Approach for Clustering Variable Length Documents

Author

Kumar, Niraj ; Srinathan, Kannan

Author_Institution

IIIT, Hyderabad

fYear

2009

fDate

6-7 March 2009

Firstpage

982

Lastpage

987

Abstract

This paper proposes a method to cluster documents of variable length. The main idea is to apply (a) automatic identification of 1, 2, and 3 grams (To reduce the dependency on huge background vocabulary support or learning or complex probabilistic approach), (b) order them by some measure of relevance, which is developed with the help of Tf-Idf and Term-Weighting approach, and finally (c) use them (instead of bag of words based approach) to create vector space model and apply some known clustering methods i. e. Bisecting K-means, K-means, hierarchical method (single link) and Graph based method. Our experimental results with publicly available text dataset (Cogprints and NewsGroup20) show remarkable improvements in the performance of these clustering algorithms with this new approach.

Keywords

document handling; learning (artificial intelligence); pattern clustering; vocabulary; K-means clustering; automatic identification; background vocabulary support; complex probabilistic approach; learning; term-weighting approach; variable length documents clustering; Classification tree analysis; Clustering algorithms; Clustering methods; Extraterrestrial measurements; Partitioning algorithms; Vocabulary; Bisecting K-means; Clustering algorithms; Document clustering; K-means; Vector Space Modelor; hierarchical methods;

fLanguage

English

Publisher

ieee

Conference_Titel

Advance Computing Conference, 2009. IACC 2009. IEEE International

Conference_Location

Patiala

Print_ISBN

978-1-4244-2927-1

Electronic_ISBN

978-1-4244-2928-8

Type

conf

DOI

10.1109/IADCC.2009.4809148

Filename

4809148