A new framework for uncertainty sampling: exploiting uncertain and positive-certain examples in similarity-based text classification

Author

Lee, Kang H. ; Kang, Byeong H.

Author_Institution

Sch. of Inf. Technol., Sydney Univ., NSW, Australia

Volume

2

fYear

2004

fDate

5-7 April 2004

Firstpage

474

Abstract

One of the major concerns with supervised learning approaches to text classification is that they require a large number of labeled examples to achieve a high level of effectiveness. Labeling such a large number of examples poses a considerable burden on human experts. Two common approaches to reduce the amount of labeled examples required are: (1) selecting informative uncertain examples for human-labeling and (2) using many inexpensive unlabeled data with a small number of labeled examples. While previous work in text classification focused only on one approach, we investigate a framework to combine both approaches in similarity-based text classification. By applying our new thresholding strategy (RinSCut) to uncertainty sampling, we propose a new framework which automatically selects informative uncertain data that should be presented to human expert for labeling and positive-certain data that are directly used for learning without human-labeling. With our similarity-based learning algorithm (KAN), experiments have been conducted on Reuters-21578 data set. Our proposed scheme has been compared with random sampling and previous conventional uncertainly sampling, based on micro and macroaveraged F₁. The results showed that if both macro and microaveraged measures are concerned, the optimal choice might be our framework.

Keywords

learning by example; pattern classification; text analysis; uncertainty handling; Reuters-21578 data set; RinSCut thresholding strategy; human experts; human-labeling; inexpensive unlabeled data; informative uncertain examples; labeled examples; positive-certain examples; similarity-based learning algorithm; similarity-based text classification; supervised learning; uncertainty sampling; Australia; Humans; Information technology; Labeling; Machine learning; Natural languages; Sampling methods; Supervised learning; Text categorization; Uncertainty;

fLanguage

English

Publisher

ieee

Conference_Titel

Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004. International Conference on

Print_ISBN

0-7695-2108-8

Type

conf

DOI

10.1109/ITCC.2004.1286699

Filename

1286699