An effective method to recognize the language of a text in a collection of multilingual documents

Author

Kadri, Said ; Moussaoui, Abdelouahab

Author_Institution

Dept. of ICST, Univ. of M´ sila, M´sila, Algeria

fYear

2013

fDate

7-9 Nov. 2013

Firstpage

208

Lastpage

211

Abstract

Identifying the language of a text means that we assign this text to a language in which it is written. This identification becomes important because of the increased diversity of textual data in different languages on the web. A real recognition of the text language is not possible if we just consider the word as a basic unit of information. It could be possible in some languages but very difficult for some other languages. The approach of the segmentation of the text into characteristic n-grams represents a very efficient alternative solution in this field. It also becomes a preferred tool in language acquisition and the extraction of knowledge from texts. In this paper, we present the most known identification methods and we propose a new method based on n-grams of characters. We also evaluate the obtained results with other methods by adopting the two approaches respectively: the segmentation into words and the segmentation into n-grams.

Keywords

Internet; character recognition; data mining; natural language processing; text analysis; World Wide Web; characteristic n-grams; knowledge extraction; language acquisition; multilingual documents; text language; textual data; Distance measurement; Educational institutions; Pragmatics; Probability; Text categorization; Text recognition; Training; N-grams; language identification; machine learning; text categorization; text mining;

fLanguage

English

Publisher

ieee

Conference_Titel

Electronics, Computer and Computation (ICECCO), 2013 International Conference on

Conference_Location

Ankara

Type

conf

DOI

10.1109/ICECCO.2013.6718265

Filename

6718265