DocumentCode :
2971764
Title :
An effective method to recognize the language of a text in a collection of multilingual documents
Author :
Kadri, Said ; Moussaoui, Abdelouahab
Author_Institution :
Dept. of ICST, Univ. of M´ sila, M´sila, Algeria
fYear :
2013
fDate :
7-9 Nov. 2013
Firstpage :
208
Lastpage :
211
Abstract :
Identifying the language of a text means that we assign this text to a language in which it is written. This identification becomes important because of the increased diversity of textual data in different languages on the web. A real recognition of the text language is not possible if we just consider the word as a basic unit of information. It could be possible in some languages but very difficult for some other languages. The approach of the segmentation of the text into characteristic n-grams represents a very efficient alternative solution in this field. It also becomes a preferred tool in language acquisition and the extraction of knowledge from texts. In this paper, we present the most known identification methods and we propose a new method based on n-grams of characters. We also evaluate the obtained results with other methods by adopting the two approaches respectively: the segmentation into words and the segmentation into n-grams.
Keywords :
Internet; character recognition; data mining; natural language processing; text analysis; World Wide Web; characteristic n-grams; knowledge extraction; language acquisition; multilingual documents; text language; textual data; Distance measurement; Educational institutions; Pragmatics; Probability; Text categorization; Text recognition; Training; N-grams; language identification; machine learning; text categorization; text mining;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Electronics, Computer and Computation (ICECCO), 2013 International Conference on
Conference_Location :
Ankara
Type :
conf
DOI :
10.1109/ICECCO.2013.6718265
Filename :
6718265
Link To Document :
بازگشت