Title :
Accurate SVM Text Classification for Highly Skewed Data Using Threshold Tuning and Query-Expansion-Based Feature Selection
Author :
Goertzel, Ben ; Venuto, James
Author_Institution :
Virginia Tech´´s Nat. Capital Operation, Arlington
Abstract :
A novel technique is described, wherein Support Vector Machines are used to perform relatively effective text categorization based on small numbers of positive examples (fewer than 10 in some cases). It is assumed that in addition to the positive examples a query describing the positive category is given (in the form of a set of key phrases or a sentence). The technique combines two innovations: a special way of altering the SVM score threshold based on looking at the distribution of scores across the training set; and, a method of feature selection that involves retaining only features that display semantic association to the content words in the query (according to a word-association database produced by statistical analysis of a parsed corpus). Examples are given on a number of test cases drawn from the Reuters and FBIS news archives.
Keywords :
pattern classification; query processing; support vector machines; text analysis; FBIS news archive; Reuters news archive; SVM; feature selection; highly skewed data; query-expansion; semantic association; support vector machines; text categorization; text classification; threshold tuning; training set; Art; Displays; Image classification; Spatial databases; Statistical analysis; Support vector machine classification; Support vector machines; Technological innovation; Testing; Text categorization;
Conference_Titel :
Neural Networks, 2006. IJCNN '06. International Joint Conference on
Conference_Location :
Vancouver, BC
Print_ISBN :
0-7803-9490-9
DOI :
10.1109/IJCNN.2006.246830