مرکز منطقه ای اطلاع رساني علوم و فناوري - Semi-supervised text categorization with only a few positive and unlabeled documents

DocumentCode :

3146694

Title :

Semi-supervised text categorization with only a few positive and unlabeled documents

Author :

Lu, Fang ; Bai, Qingyuan

Author_Institution :

Coll. of Math. & Comput. Sci., Fuzhou Univ., Fuzhou, China

Volume :

fYear :

2010

fDate :

16-18 Oct. 2010

Firstpage :

3075

Lastpage :

3079

Abstract :

This paper studies a special case of semi-supervised text categorization. We want to build a text classifier with only a set P of labeled positive documents from one class (called positive class) and a set U of a large number of unlabeled documents from both positive class and other diverse classes (called negative class). This kind of semi-supervised text classification is called positive and unlabeled learning (PU-Learning). Although there are some effective methods for PU-Learning, they do not perform very well when the labeled positive documents are very few. In this paper, we propose a refined method to do the PU-Learning with the known technique combining Rocchio and K-means algorithm. Considering the set P may be very small (≤5%), not only we extract more reliable negative documents from U but also enlarge the size of P with extracting some most reliable positive documents from U. Our experimental results show that the refined method can perform better when the set P is very small.

Keywords :

learning (artificial intelligence); text analysis; K-means algorithm; PU-learning; labeled positive documents; positive class; semisupervised text categorization; unlabeled documents; unlabeled learning; Classification algorithms; Clustering algorithms; Prototypes; Support vector machines; Text categorization; Training; Web pages; cluster; semi-supervised learning; text categorization;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Biomedical Engineering and Informatics (BMEI), 2010 3rd International Conference on

Conference_Location :

Yantai

Print_ISBN :

978-1-4244-6495-1

Type :

conf

DOI :

10.1109/BMEI.2010.5639749

Filename :

5639749

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3146694