DocumentCode
2816581
Title
A new framework for uncertainty sampling: exploiting uncertain and positive-certain examples in similarity-based text classification
Author
Lee, Kang H. ; Kang, Byeong H.
Author_Institution
Sch. of Inf. Technol., Sydney Univ., NSW, Australia
Volume
2
fYear
2004
fDate
5-7 April 2004
Firstpage
474
Abstract
One of the major concerns with supervised learning approaches to text classification is that they require a large number of labeled examples to achieve a high level of effectiveness. Labeling such a large number of examples poses a considerable burden on human experts. Two common approaches to reduce the amount of labeled examples required are: (1) selecting informative uncertain examples for human-labeling and (2) using many inexpensive unlabeled data with a small number of labeled examples. While previous work in text classification focused only on one approach, we investigate a framework to combine both approaches in similarity-based text classification. By applying our new thresholding strategy (RinSCut) to uncertainty sampling, we propose a new framework which automatically selects informative uncertain data that should be presented to human expert for labeling and positive-certain data that are directly used for learning without human-labeling. With our similarity-based learning algorithm (KAN), experiments have been conducted on Reuters-21578 data set. Our proposed scheme has been compared with random sampling and previous conventional uncertainly sampling, based on micro and macroaveraged F1. The results showed that if both macro and microaveraged measures are concerned, the optimal choice might be our framework.
Keywords
learning by example; pattern classification; text analysis; uncertainty handling; Reuters-21578 data set; RinSCut thresholding strategy; human experts; human-labeling; inexpensive unlabeled data; informative uncertain examples; labeled examples; positive-certain examples; similarity-based learning algorithm; similarity-based text classification; supervised learning; uncertainty sampling; Australia; Humans; Information technology; Labeling; Machine learning; Natural languages; Sampling methods; Supervised learning; Text categorization; Uncertainty;
fLanguage
English
Publisher
ieee
Conference_Titel
Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004. International Conference on
Print_ISBN
0-7695-2108-8
Type
conf
DOI
10.1109/ITCC.2004.1286699
Filename
1286699
Link To Document