• DocumentCode
    3483407
  • Title

    A SOM-based document clustering using phrases

  • Author

    Bakus, J. ; Hussin, M.F. ; Kamel, M.

  • Author_Institution
    Dept. of Syst. Design Eng, Waterloo Univ., Ont., Canada
  • Volume
    5
  • fYear
    2002
  • fDate
    18-22 Nov. 2002
  • Firstpage
    2212
  • Abstract
    Most of the existing techniques for document clustering rely on a "bag of words" document representation. Each word in the document is considered as a separate feature, ignoring the word order. We investigate the use of phrases rather than words as document features for the document clustering. We present a phrase grammar extraction technique, and use the extracted phrases as the features in a self-organizing map based document clustering algorithm. We present clustering results using the REUTERS corpus and show an improvement in clustering performance using both entropy and F-measure evaluation measures.
  • Keywords
    document handling; natural languages; pattern clustering; self-organising feature maps; F-measure evaluation measures; REUTERS corpus; SOM-based document clustering; bag of words; document features; document representation; phrase grammar extraction technique; self-organizing map; Automatic control; Clustering algorithms; Computer science; Data mining; Entropy; Information retrieval; Internet; Machine learning; Merging; Organizing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Neural Information Processing, 2002. ICONIP '02. Proceedings of the 9th International Conference on
  • Print_ISBN
    981-04-7524-1
  • Type

    conf

  • DOI
    10.1109/ICONIP.2002.1201886
  • Filename
    1201886