DocumentCode
2971764
Title
An effective method to recognize the language of a text in a collection of multilingual documents
Author
Kadri, Said ; Moussaoui, Abdelouahab
Author_Institution
Dept. of ICST, Univ. of M´ sila, M´sila, Algeria
fYear
2013
fDate
7-9 Nov. 2013
Firstpage
208
Lastpage
211
Abstract
Identifying the language of a text means that we assign this text to a language in which it is written. This identification becomes important because of the increased diversity of textual data in different languages on the web. A real recognition of the text language is not possible if we just consider the word as a basic unit of information. It could be possible in some languages but very difficult for some other languages. The approach of the segmentation of the text into characteristic n-grams represents a very efficient alternative solution in this field. It also becomes a preferred tool in language acquisition and the extraction of knowledge from texts. In this paper, we present the most known identification methods and we propose a new method based on n-grams of characters. We also evaluate the obtained results with other methods by adopting the two approaches respectively: the segmentation into words and the segmentation into n-grams.
Keywords
Internet; character recognition; data mining; natural language processing; text analysis; World Wide Web; characteristic n-grams; knowledge extraction; language acquisition; multilingual documents; text language; textual data; Distance measurement; Educational institutions; Pragmatics; Probability; Text categorization; Text recognition; Training; N-grams; language identification; machine learning; text categorization; text mining;
fLanguage
English
Publisher
ieee
Conference_Titel
Electronics, Computer and Computation (ICECCO), 2013 International Conference on
Conference_Location
Ankara
Type
conf
DOI
10.1109/ICECCO.2013.6718265
Filename
6718265
Link To Document