• DocumentCode
    3223915
  • Title

    A feature selection for Korean Web document clustering

  • Author

    Park, Heum ; Kim, Young-Gi ; Kwon, Hyuk-Chul

  • Author_Institution
    AI Lab. Dept. of Comput. Sci., Pusan Nat. Univ., South Korea
  • Volume
    3
  • fYear
    2004
  • fDate
    2-6 Nov. 2004
  • Firstpage
    2650
  • Abstract
    This paper is a comparative study of feature selection methods for Korean Web documents clustering. First, we focused on how the term feature and the co-link of Web documents affect clustering performance. We clustered Web documents by native term feature, co-link and both, and compared the output results with the originally allocated category. And we selected term features for each category using X2, information gain (IG), and mutual information (MI) from training documents, and applied these features to other experimental documents. In addition we suggested a new method named max feature selection, which selects terms that have the maximum count for a category in each experimental document, and applied X2 (or MI or IG) values to each term instead of term frequency of documents, and clustered them. In the results, X2 shows a better performance than IG or MI, but the difference appears to be slight. But when we applied the max feature selection method, the clustering performance improved notably. Max feature selection is a simple but effective means of feature space reduction and shows powerful performance for Korean Web document clustering.
  • Keywords
    Internet; document handling; feature extraction; information retrieval; pattern clustering; Korean Web document clustering; feature selection method; feature space reduction; information gain; max feature selection; mutual information; Artificial intelligence; Computer science; Data mining; Explosions; Frequency; Information retrieval; Information science; Internet; Libraries; Mutual information;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Industrial Electronics Society, 2004. IECON 2004. 30th Annual Conference of IEEE
  • Print_ISBN
    0-7803-8730-9
  • Type

    conf

  • DOI
    10.1109/IECON.2004.1432224
  • Filename
    1432224