• DocumentCode
    174895
  • Title

    Robust Language Identification of Noisy Texts: Proposal of Hybrid Approaches

  • Author

    Abainia, K. ; Ouamour, S. ; Sayoud, H.

  • Author_Institution
    USTHB Univ., Algiers, Algeria
  • fYear
    2014
  • fDate
    1-5 Sept. 2014
  • Firstpage
    228
  • Lastpage
    232
  • Abstract
    This paper deals with the problem of automatic language identification of noisy texts, which represents an important task in natural language processing. Actually, there exist several works in this field, which are based on statistical and machine learning approaches for different categories of texts. Unfortunately, most of the proposed methods work fine on clean texts and/or long texts, but often present a failure when the text is corrupted or too short. In this research work, we use a typical dataset consisting of short texts collected from several discussion forums containing several types of noises. Our dataset contains 32 different languages, where we notice that some languages are quite different while some others are too closed. In this investigation, we propose two types of methods to identify the text language: term-based method and character-based method. Moreover, we propose two hybrid methods to enhance the performances of those techniques. Experiments show that the proposed hybrid methods are quite interesting and present good language identification performances in noisy texts.
  • Keywords
    natural language processing; text analysis; automatic language identification; character-based method; natural language processing; noisy texts; term-based method; Conferences; Databases; Expert systems; Automatic Language Identification; Hybrid Approach; Natural Language Processing; Noisy Text; Text categorizationn;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database and Expert Systems Applications (DEXA), 2014 25th International Workshop on
  • Conference_Location
    Munich
  • ISSN
    1529-4188
  • Print_ISBN
    978-1-4799-5721-7
  • Type

    conf

  • DOI
    10.1109/DEXA.2014.55
  • Filename
    6974854