• DocumentCode
    3075157
  • Title

    A New Approach for Clustering Variable Length Documents

  • Author

    Kumar, Niraj ; Srinathan, Kannan

  • Author_Institution
    IIIT, Hyderabad
  • fYear
    2009
  • fDate
    6-7 March 2009
  • Firstpage
    982
  • Lastpage
    987
  • Abstract
    This paper proposes a method to cluster documents of variable length. The main idea is to apply (a) automatic identification of 1, 2, and 3 grams (To reduce the dependency on huge background vocabulary support or learning or complex probabilistic approach), (b) order them by some measure of relevance, which is developed with the help of Tf-Idf and Term-Weighting approach, and finally (c) use them (instead of bag of words based approach) to create vector space model and apply some known clustering methods i. e. Bisecting K-means, K-means, hierarchical method (single link) and Graph based method. Our experimental results with publicly available text dataset (Cogprints and NewsGroup20) show remarkable improvements in the performance of these clustering algorithms with this new approach.
  • Keywords
    document handling; learning (artificial intelligence); pattern clustering; vocabulary; K-means clustering; automatic identification; background vocabulary support; complex probabilistic approach; learning; term-weighting approach; variable length documents clustering; Classification tree analysis; Clustering algorithms; Clustering methods; Extraterrestrial measurements; Partitioning algorithms; Vocabulary; Bisecting K-means; Clustering algorithms; Document clustering; K-means; Vector Space Modelor; hierarchical methods;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advance Computing Conference, 2009. IACC 2009. IEEE International
  • Conference_Location
    Patiala
  • Print_ISBN
    978-1-4244-2927-1
  • Electronic_ISBN
    978-1-4244-2928-8
  • Type

    conf

  • DOI
    10.1109/IADCC.2009.4809148
  • Filename
    4809148