A New Method of Training Sample Selection in Text Classification

Author

Liao, Yixing ; Pan, Xuezeng

Author_Institution

Dept. of Comput. Sci. & Technol., Zhejiang Univ., Hangzhou, China

Volume

1

fYear

2010

fDate

6-7 March 2010

Firstpage

211

Lastpage

214

Abstract

Aiming to noise samples in the training dataset, a new method for reducing the amount of training dataset is proposed in the paper which is applicable to text classification. This method describes the distribution of training dataset according to the representativeness score of samples in the class they belong to, so as to show representative samples and noise samples in each class. The new method is applied on Chinese text dataset provided by Fudan Database Center. The experiments show that the proposed method can reduce noise samples effectively, improve the performance of classification and decrease the computational cost.

Keywords

classification; natural language processing; text analysis; noise samples reduction; text classification; training dataset distribution; training sample selection; Computational efficiency; Computer science; Educational technology; Frequency; Iterative methods; Mutual information; Noise reduction; Paper technology; Probability; Text categorization; representativeness score; text classification; training dataset selection;

fLanguage

English

Publisher

ieee

Conference_Titel

Education Technology and Computer Science (ETCS), 2010 Second International Workshop on

Conference_Location

Wuhan

Print_ISBN

978-1-4244-6388-6

Electronic_ISBN

978-1-4244-6389-3

Type

conf

DOI

10.1109/ETCS.2010.621

Filename

5458972