Title :
Document filtering boosted by unlabeled data
Author :
Park, Seong-Bae ; Zhang, Byoung-Tak
Author_Institution :
Artificial Intelligence Lab., Seoul Nat. Univ., South Korea
Abstract :
This paper describes three learning methods for document filtering that use unlabeled data. The proposed methods are based on a committee of the classifiers which are trained on a small set of labeled data and then augmented by a large number of unlabeled data. By taking advantage of unlabeled data, the effective number of labeled data needed is significantly reduced and the filtering accuracy is increased. The use of unlabeled data is important because obtaining labeled data is difficult and time-consuming, while unlabeled data are abundant. For all proposed methods, the experimental results show that the accuracy is improved up to 9.2% with only two-thirds as many labeled data as the method which does not use unlabeled data
Keywords :
document handling; information retrieval; learning (artificial intelligence); AdaBoost method; EM-like method; active sampling method; classifiers; document filtering; labeled data; learning methods; unlabeled data; Artificial intelligence; Bagging; Computer science; Data engineering; Filtering; Filters; Humans; Labeling; Machine learning algorithms; Text processing;
Conference_Titel :
Industrial Electronics, 2001. Proceedings. ISIE 2001. IEEE International Symposium on
Conference_Location :
Pusan
Print_ISBN :
0-7803-7090-2
DOI :
10.1109/ISIE.2001.931808