• DocumentCode
    2866500
  • Title

    A Novel Document Clustering Model Based on Latent Semantic Analysis

  • Author

    Song, Wei ; Park, Soon Cheol

  • Author_Institution
    Chonbuk Nat. Univ. Korea, Jeonju
  • fYear
    2007
  • fDate
    29-31 Oct. 2007
  • Firstpage
    539
  • Lastpage
    542
  • Abstract
    In this paper we propose a document representation model based on latent semantic analysis (LSA) for text clustering. Most classic clustering systems represent document with a set of indices, which have been known as vector space model (VSM). In such a model, documents are encoded as vectors in N-dimensional space, where N is the number of unique terms. However, this method causes that the scalability will be poor and the cost of computational time will be high. Latent semantic analysis is a promising approach which attempts to construct a latent semantic structure in textual data and finds relevant documents such that they may not even share any common words, moreover, it reduces the large term-by-document matrix to a smaller one and provides a robust space for clustering. Two clustering algorithms, K-means and genetic algorithm (GA), are constructed in LSA space to demonstrate the effectiveness and validity of our text representation model. We use SSTRESS criteria to analyze the dissimilarity between the original corpus matrix and the approximate objective matrix with different ranks corresponding to the performance of the two clustering algorithms. The superiority of GA and K-means applied in LSA model over conventional GA and K-means in VSM is demonstrated by providing good text clustering results.
  • Keywords
    data structures; genetic algorithms; pattern clustering; statistical analysis; text analysis; K-means clustering; SSTRESS criteria; document clustering model; document representation model; genetic algorithm; latent semantic analysis; term-by-document matrix; text clustering; text representation model; vector space model; Algorithm design and analysis; Clustering algorithms; Computational efficiency; Genetic algorithms; Information analysis; Knowledge engineering; Partitioning algorithms; Performance analysis; Robustness; Scalability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Semantics, Knowledge and Grid, Third International Conference on
  • Conference_Location
    Shan Xi
  • Print_ISBN
    0-7695-3007-9
  • Electronic_ISBN
    978-0-7695-3007-9
  • Type

    conf

  • DOI
    10.1109/SKG.2007.154
  • Filename
    4438614