Title :
An effective method to recognize the language of a text in a collection of multilingual documents
Author :
Kadri, Said ; Moussaoui, Abdelouahab
Author_Institution :
Dept. of ICST, Univ. of M´ sila, M´sila, Algeria
Abstract :
Identifying the language of a text means that we assign this text to a language in which it is written. This identification becomes important because of the increased diversity of textual data in different languages on the web. A real recognition of the text language is not possible if we just consider the word as a basic unit of information. It could be possible in some languages but very difficult for some other languages. The approach of the segmentation of the text into characteristic n-grams represents a very efficient alternative solution in this field. It also becomes a preferred tool in language acquisition and the extraction of knowledge from texts. In this paper, we present the most known identification methods and we propose a new method based on n-grams of characters. We also evaluate the obtained results with other methods by adopting the two approaches respectively: the segmentation into words and the segmentation into n-grams.
Keywords :
Internet; character recognition; data mining; natural language processing; text analysis; World Wide Web; characteristic n-grams; knowledge extraction; language acquisition; multilingual documents; text language; textual data; Distance measurement; Educational institutions; Pragmatics; Probability; Text categorization; Text recognition; Training; N-grams; language identification; machine learning; text categorization; text mining;
Conference_Titel :
Electronics, Computer and Computation (ICECCO), 2013 International Conference on
Conference_Location :
Ankara
DOI :
10.1109/ICECCO.2013.6718265