• DocumentCode
    3625622
  • Title

    Language Indentification: How to Distinguish Similar Languages?

  • Author

    Nikola Ljubesic;Nives Mikelic;Damir Boras

  • Author_Institution
    Department of Information Sciences, Faculty of Philosophy, University of Zagreb, Ivana Lu?i?a 3, 10000 Zagreb, Croatia. E-mail: nljubesi@ffzg.hr
  • fYear
    2007
  • fDate
    6/1/2007 12:00:00 AM
  • Firstpage
    541
  • Lastpage
    546
  • Abstract
    The goal of this paper is to discuss the language identification problem of Croatian, language that even state-of-the-art language identification tools find, hard to distinguish from similar languages, such as Serbian, Slovenian or Slovak language. We developed the tool that implements the list of Croatian most frequent words with the threshold that each document needs to satisfy, we added, the specific characters elimination rule, applied second-order Markov model classification and a, rule of forbidden words. Finally, we built up the tool that, overperforms current tools in discriminating between these similar languages.
  • Keywords
    "Entropy","Vocabulary","Information retrieval","Text mining","Morphology","Frequency","Support vector machines","Kernel","Text categorization","Information technology"
  • Publisher
    ieee
  • Conference_Titel
    Information Technology Interfaces, 2007. ITI 2007. 29th International Conference on
  • ISSN
    1330-1012
  • Print_ISBN
    953-7138-09-7
  • Type

    conf

  • DOI
    10.1109/ITI.2007.4283829
  • Filename
    4283829