• DocumentCode
    3776815
  • Title

    Frequent term based text document clustering: A new approach

  • Author

    Manoj Kumar;D K Yadav;Vijay Kumar Gupta

  • Author_Institution
    Department of Information and Technology, BBDNITM, Lucknow, India
  • fYear
    2015
  • Firstpage
    11
  • Lastpage
    15
  • Abstract
    Document clustering is used to organize the documents into groups. VSM (Vector Space Model) is a technique used to represent the document as a vector. Working with VSM to cluster the documents is easier. The main problem with text documents clustering is very high dimensionality of data. A term in the document represents a dimension. To reduce the dimensions of the document vector space, it is preprocessed. The main techniques involved are stemming and term filtering for dimensions reduction of document vectors. After dimensions reduction, term frequency vectors corresponding to each document are generated, where each cell in the term frequency vector represents frequencies of a term. Using proposed method in the paper, each pair of term frequency vectors are compared to find out the similarity value between every two corresponding documents. In this way, three similarity matrices minimum_match, maximum_match and average_match are generated which are further used in various clustering techniques to produce clusters. Clusters produced using proposed approach are compared with that of clusters produced based on cosine similarity in terms of F-measure. Higher values of F-measure for clusters produced using proposed method shows that proposed algorithm is better.
  • Keywords
    "Clustering algorithms","Data mining","Filtering","Computer science","Standards","Time-frequency analysis","Semantics"
  • Publisher
    ieee
  • Conference_Titel
    Soft Computing Techniques and Implementations (ICSCTI), 2015 International Conference on
  • Type

    conf

  • DOI
    10.1109/ICSCTI.2015.7489630
  • Filename
    7489630