• DocumentCode
    1047811
  • Title

    A New Text Categorization Technique Using Distributional Clustering and Learning Logic

  • Author

    Al-Mubaid, Hisham ; Umair, Syed A.

  • Author_Institution
    Houston Univ., TX
  • Volume
    18
  • Issue
    9
  • fYear
    2006
  • Firstpage
    1156
  • Lastpage
    1165
  • Abstract
    Text categorization is continuing to be one of the most researched NLP problems due to the ever-increasing amounts of electronic documents and digital libraries. In this paper, we present a new text categorization method that combines the distributional clustering of words and a learning logic technique, called Lsquare, for constructing text classifiers. The high dimensionality of text in a document has not been fruitful for the task of categorization, for which reason, feature clustering has been proven to be an ideal alternative to feature selection for reducing the dimensionality. We, therefore, use distributional clustering method (IB) to generate an efficient representation of documents and apply Lsquare for training text classifiers. The method was extensively tested and evaluated. The proposed method achieves higher or comparable classification accuracy and F1 results compared with SVM on exact experimental settings with a small number of training documents on three benchmark data sets WebKB, 20Newsgroup, and Reuters-21578. The results prove that the method is a good choice for applications with a limited amount of labeled training data. We also demonstrate the effect of changing training size on the classification performance of the learners
  • Keywords
    classification; digital libraries; learning (artificial intelligence); pattern clustering; text analysis; word processing; Lsquare learning logic technique; NLP problem; SVM; digital library; dimensionality reduction; distributional word clustering method; document representation; electronic document; feature clustering; feature selection; machine learning; text categorization technique; text classifier; Benchmark testing; Data mining; Electronic mail; Logic testing; Machine learning; Machine learning algorithms; Support vector machine classification; Support vector machines; Text categorization; Training data; Text categorization; feature selection; machine learning.;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2006.135
  • Filename
    1661508