DocumentCode :
3576420
Title :
Toward robust classification using the Open Directory Project
Author :
JongWoo Ha ; Jung-Hyun Lee ; Won-Jun Jang ; Yong-Ku Lee ; Sangkeun Lee
Author_Institution :
Korea Univ., Seoul, South Korea
fYear :
2014
Firstpage :
607
Lastpage :
612
Abstract :
The Open Directory Project (ODP) is a large scale, high quality and publicly available web directory utilized in many studies and real-world applications. In this paper, we explore training data expansion techniques for text classification as one of the possible directions to deal with the sparse characteristic of the ODP dataset. We propose a dozen classification methods, which can be differentiated by (1) from which categories training data is expanded, and (2) how the expanded training data is merged to generate centroid vectors. Evaluation results show that training data expansion significantly improves the classification performance more than representative classifiers. We also find that (1) child and descendant categories are more valuable sources to expand training data than parent and ancestor categories, and (2) distance-based weighting is superior to simple averaging to merge the expanded training data.
Keywords :
Internet; merging; pattern classification; text analysis; ODP dataset; Open Directory Project; ancestor categories; categories training data; centroid vectors; child categories; data merging; descendant categories; distance-based weighting; parent categories; publicly available Web directory; representative classifiers; simple averaging; text classification; training data expansion techniques; Niobium; Support vector machines; Taxonomy; Training; Training data; Vectors; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Science and Advanced Analytics (DSAA), 2014 International Conference on
Type :
conf
DOI :
10.1109/DSAA.2014.7058134
Filename :
7058134
Link To Document :
بازگشت