DocumentCode :
639790
Title :
Performance comparison study of language identification tools for identification of Farsi web pages
Author :
Kordestanchi, Hamed ; Naderi, Habib
Author_Institution :
Mobin Inf. Technol. Res. Center (MITRC), Imam Hossein Comprehensive Univ., Tehran, Iran
fYear :
2013
fDate :
28-30 May 2013
Firstpage :
489
Lastpage :
494
Abstract :
With more and more textual information making its way on-line, the importance of web page language identification has become more apparent. Up to now, different methods of language identification for text have been proposed and different tools are prepared according to these methods. But no comparison study on the performance of these tools on web pages has been reported yet. Therefor we decided to evaluate available tools in the field of web page language identification. Our primary desired language is Farsi (Persian), but for acquiring more precision and performing a deeper study we included Arabic, Urdu and English languages, as well. For this purpose, we gathered required information and made two data sets, `Clean Web´ representing clean and noise-free web pages, and `Ordinary Web´ representing normal noisy web pages which we usually visit in our daily web exploring tasks. We evaluated the data sets and tools against desired measures of accuracy and time. Finally we reported the best tools evaluated and discussed the deep effect of noise on their behavior. Results of this study can be applied in vast areas of application such as bootstrapping automatic web page translation, ontology extraction, topic classification, language-specific web crawling...
Keywords :
Internet; Web sites; natural language processing; ontologies (artificial intelligence); text analysis; Arabic languages; English languages; Farsi Web page identification; Urdu languages; Web page language identification; bootstrapping automatic Web page translation; clean Web; language identification tools; language-specific Web crawling; noise-free Web pages; ontology extraction; ordinary Web; performance comparison study; textual information; topic classification; Accuracy; Libraries; Noise; Noise measurement; Portals; Time measurement; Web pages; Farsi (Persian); language identification; language profile; n-gram; noise; web page;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information and Knowledge Technology (IKT), 2013 5th Conference on
Conference_Location :
Shiraz
Print_ISBN :
978-1-4673-6489-8
Type :
conf
DOI :
10.1109/IKT.2013.6620118
Filename :
6620118
Link To Document :
بازگشت