DocumentCode :
1653662
Title :
Phoneme Based Representation for Vietnamese Web Page Classification
Author :
Nguyen, Giang-Son ; Gao, Xiaoying ; Andreae, Peter
Author_Institution :
Sch. of Eng. & Comput. Sci., Victoria Univ. of Wellington, Wellington, New Zealand
Volume :
1
fYear :
2011
Firstpage :
15
Lastpage :
22
Abstract :
This paper proposes a novel text representation for Web pages written in Vietnamese. This representation is based on an analysis of Vietnamese documents at phonetic level in which each document will be represented as a bag of phonemes. It is designed to capture sound-based information in documents and to be helpful for resolving some non-topic text classification problems including automatic Vietnamese language identification of a document, ancient Vietnamese document detection, author identification, and poem identification. We apply some typical machine learning methods including NB, KNN and SVMs to build text classifiers. The experimental results show a significant improvement in terms of effectiveness and efficiency compared to the traditional syllable based representation in most cases.
Keywords :
Internet; learning (artificial intelligence); natural language processing; pattern classification; support vector machines; text analysis; KNN; NB; SVM; Vietnamese Web page classification; Vietnamese document analysis; Vietnamese document detection; author identification; automatic Vietnamese language identification; machine learning methods; nontopic text classification problems; phoneme based representation; phonetic level; poem identification; sound-based information; text classifiers; text representation; Machine learning; Niobium; Support vector machines; Text categorization; Training; Vocabulary; Web pages; Classification; Document representation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Intelligence and Intelligent Agent Technology (WI-IAT), 2011 IEEE/WIC/ACM International Conference on
Conference_Location :
Lyon
Print_ISBN :
978-1-4577-1373-6
Electronic_ISBN :
978-0-7695-4513-4
Type :
conf
DOI :
10.1109/WI-IAT.2011.142
Filename :
6040490
Link To Document :
بازگشت