Title :
Research on the categorization accuracy of different similarity measures on Chinese texts
Author :
Li, Xiangdong ; Liu, Hangyu ; Jia, Han ; Huang, Li
Author_Institution :
Sch. of Inf. Manage., Wuhan Univ., Wuhan, China
Abstract :
This paper works on the most intensively studied algorithm- k Nearest Neighbor algorithm. The purpose is to investigate the performance of different similarity measures in the kNN on Chinese texts. The two measures that we focus on are cosine value and Jensen-Shannon Divergence. We use both the corpus collected from the Sogou, whose data extracts from the website of Sohu.com, and datasets that we have processed from real word. The results of our experiment indicate that difference of similarity metrics significantly affects the categorization accuracy.
Keywords :
Web sites; natural language processing; text analysis; Chinese texts; Jensen-Shannon divergence; Sogou; Sohu.com; Web site; categorization accuracy; cosine value; k-nearest neighbor algorithm; similarity measure; Accuracy; Classification algorithms; Entropy; Libraries; Machine learning algorithms; Support vector machine classification; Text categorization; Chinese text categorization; KNN algorithm; Similarity; Sougou Corpus;
Conference_Titel :
Business Management and Electronic Information (BMEI), 2011 International Conference on
Conference_Location :
Guangzhou
Print_ISBN :
978-1-61284-108-3
DOI :
10.1109/ICBMEI.2011.5920956