• DocumentCode
    2042101
  • Title

    A feature selection method for document clustering based on part-of-speech and word co-occurrence

  • Author

    Liu, Zitao ; Yu, Wenchao ; Deng, Yalan ; Wang, Yongtao ; Bian, Zhiqi

  • Author_Institution
    Int. Sch. of Software, Wuhan Univ., Wuhan, China
  • Volume
    5
  • fYear
    2010
  • fDate
    10-12 Aug. 2010
  • Firstpage
    2331
  • Lastpage
    2334
  • Abstract
    Feature selection is a process which chooses a subset from the original feature set according to some rules. The selected feature retains original physical meaning and provides a better understanding for the data and learning process. However, few modern feature selection approaches take the advantage of features´ context information. Based on this analysis, we propose a novel feature selection method based on part-of-speech and word co-occurrence. According the components of Chinese document text, we utilize the words´ part-of-speech attributes to filter lots of meaningless terms. Then we define and use co-occurrence words by their part-of-speech to select features. In the evaluating process, we use the text corpus from Sogou Lab to do some experiments and use Entropy and Precision as criteria to give an objective evaluation of document clustering performance. The results show that our method can select better features and get a more pleasant clustering performance.
  • Keywords
    feature extraction; pattern clustering; speech synthesis; text analysis; unsupervised learning; word processing; Chinese document; Sogou lab; context information; document clustering; feature selection method; learning process; part of speech; text corpus; word co-occurrence; Context; Educational institutions; Entropy; Feature extraction; Machine learning; Software; Speech; document clustering; feature selection; part-ofspeech; word co-occurrence;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on
  • Conference_Location
    Yantai, Shandong
  • Print_ISBN
    978-1-4244-5931-5
  • Type

    conf

  • DOI
    10.1109/FSKD.2010.5569827
  • Filename
    5569827