• DocumentCode
    423337
  • Title

    A feature selection algorithm for document clustering based on word co-occurrence frequency

  • Author

    Liu, Yuan-Chao ; Wang, Xiao-long ; Liu, Bing-quan

  • Author_Institution
    Sch. of Comput. Sci. & Technol., Harbin Inst. of Technol., China
  • Volume
    5
  • fYear
    2004
  • fDate
    26-29 Aug. 2004
  • Firstpage
    2963
  • Abstract
    Constructing feature space by only selecting more informative words can speed up document clustering algorithm greatly, and the cluster quality is not affected. In this paper, firstly, the impact of feature selection on document clustering is discussed, then, a new solution for feature selection was brought forward which is based on word co-occurrence frequency. According to cluster hypothesis, the documents from the same class are more similar to each other when they are represented in vector space model (VSM), so many of the words from these documents are always in company with each other. We find these words by word co-occurrence, and then construct reduced feature space for clustering. Experiments show that the selected features are more salient. Clustering documents in the new reduced feature space, run time is shortened greatly, whereas the cluster quality is almost unchanged, thus make clustering algorithm more suitable for practical use.
  • Keywords
    feature extraction; pattern clustering; text analysis; vectors; cluster hypothesis; document clustering algorithm; feature selection algorithm; feature space construction; vector space model; word cooccurrence frequency; Clustering algorithms; Computer science; Explosives; Frequency; Internet; Navigation; Partitioning algorithms; Search engines; Space technology; Unsupervised learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics, 2004. Proceedings of 2004 International Conference on
  • Print_ISBN
    0-7803-8403-2
  • Type

    conf

  • DOI
    10.1109/ICMLC.2004.1378540
  • Filename
    1378540