DocumentCode
1047811
Title
A New Text Categorization Technique Using Distributional Clustering and Learning Logic
Author
Al-Mubaid, Hisham ; Umair, Syed A.
Author_Institution
Houston Univ., TX
Volume
18
Issue
9
fYear
2006
Firstpage
1156
Lastpage
1165
Abstract
Text categorization is continuing to be one of the most researched NLP problems due to the ever-increasing amounts of electronic documents and digital libraries. In this paper, we present a new text categorization method that combines the distributional clustering of words and a learning logic technique, called Lsquare, for constructing text classifiers. The high dimensionality of text in a document has not been fruitful for the task of categorization, for which reason, feature clustering has been proven to be an ideal alternative to feature selection for reducing the dimensionality. We, therefore, use distributional clustering method (IB) to generate an efficient representation of documents and apply Lsquare for training text classifiers. The method was extensively tested and evaluated. The proposed method achieves higher or comparable classification accuracy and F1 results compared with SVM on exact experimental settings with a small number of training documents on three benchmark data sets WebKB, 20Newsgroup, and Reuters-21578. The results prove that the method is a good choice for applications with a limited amount of labeled training data. We also demonstrate the effect of changing training size on the classification performance of the learners
Keywords
classification; digital libraries; learning (artificial intelligence); pattern clustering; text analysis; word processing; Lsquare learning logic technique; NLP problem; SVM; digital library; dimensionality reduction; distributional word clustering method; document representation; electronic document; feature clustering; feature selection; machine learning; text categorization technique; text classifier; Benchmark testing; Data mining; Electronic mail; Logic testing; Machine learning; Machine learning algorithms; Support vector machine classification; Support vector machines; Text categorization; Training data; Text categorization; feature selection; machine learning.;
fLanguage
English
Journal_Title
Knowledge and Data Engineering, IEEE Transactions on
Publisher
ieee
ISSN
1041-4347
Type
jour
DOI
10.1109/TKDE.2006.135
Filename
1661508
Link To Document