• DocumentCode
    2971764
  • Title

    An effective method to recognize the language of a text in a collection of multilingual documents

  • Author

    Kadri, Said ; Moussaoui, Abdelouahab

  • Author_Institution
    Dept. of ICST, Univ. of M´ sila, M´sila, Algeria
  • fYear
    2013
  • fDate
    7-9 Nov. 2013
  • Firstpage
    208
  • Lastpage
    211
  • Abstract
    Identifying the language of a text means that we assign this text to a language in which it is written. This identification becomes important because of the increased diversity of textual data in different languages on the web. A real recognition of the text language is not possible if we just consider the word as a basic unit of information. It could be possible in some languages but very difficult for some other languages. The approach of the segmentation of the text into characteristic n-grams represents a very efficient alternative solution in this field. It also becomes a preferred tool in language acquisition and the extraction of knowledge from texts. In this paper, we present the most known identification methods and we propose a new method based on n-grams of characters. We also evaluate the obtained results with other methods by adopting the two approaches respectively: the segmentation into words and the segmentation into n-grams.
  • Keywords
    Internet; character recognition; data mining; natural language processing; text analysis; World Wide Web; characteristic n-grams; knowledge extraction; language acquisition; multilingual documents; text language; textual data; Distance measurement; Educational institutions; Pragmatics; Probability; Text categorization; Text recognition; Training; N-grams; language identification; machine learning; text categorization; text mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Electronics, Computer and Computation (ICECCO), 2013 International Conference on
  • Conference_Location
    Ankara
  • Type

    conf

  • DOI
    10.1109/ICECCO.2013.6718265
  • Filename
    6718265