Title :
A New Text Categorization Technique Using Distributional Clustering and Learning Logic
Author :
Al-Mubaid, Hisham ; Umair, Syed A.
Author_Institution :
Houston Univ., TX
Abstract :
Text categorization is continuing to be one of the most researched NLP problems due to the ever-increasing amounts of electronic documents and digital libraries. In this paper, we present a new text categorization method that combines the distributional clustering of words and a learning logic technique, called Lsquare, for constructing text classifiers. The high dimensionality of text in a document has not been fruitful for the task of categorization, for which reason, feature clustering has been proven to be an ideal alternative to feature selection for reducing the dimensionality. We, therefore, use distributional clustering method (IB) to generate an efficient representation of documents and apply Lsquare for training text classifiers. The method was extensively tested and evaluated. The proposed method achieves higher or comparable classification accuracy and F1 results compared with SVM on exact experimental settings with a small number of training documents on three benchmark data sets WebKB, 20Newsgroup, and Reuters-21578. The results prove that the method is a good choice for applications with a limited amount of labeled training data. We also demonstrate the effect of changing training size on the classification performance of the learners
Keywords :
classification; digital libraries; learning (artificial intelligence); pattern clustering; text analysis; word processing; Lsquare learning logic technique; NLP problem; SVM; digital library; dimensionality reduction; distributional word clustering method; document representation; electronic document; feature clustering; feature selection; machine learning; text categorization technique; text classifier; Benchmark testing; Data mining; Electronic mail; Logic testing; Machine learning; Machine learning algorithms; Support vector machine classification; Support vector machines; Text categorization; Training data; Text categorization; feature selection; machine learning.;
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
DOI :
10.1109/TKDE.2006.135