• DocumentCode
    670225
  • Title

    Using bag-of-words to distinguish similar languages: How efficient are they?

  • Author

    Zampieri, Marcos

  • Author_Institution
    Saarland Univ., Saarbrücken, Germany
  • fYear
    2013
  • fDate
    19-21 Nov. 2013
  • Firstpage
    37
  • Lastpage
    41
  • Abstract
    This paper presents a number of experiments describing the use of machine learning algorithms and bag-of-words to the task of automatic language identification. The paper focuses on the identification of language varieties, which is a known weakness of general purpose language identification methods. This question was addressed by a number of studies in the recent years, most of them relying on character n-gram language models. In this paper, I experiment simple bag-of-words and compare the results with previously proposed n-gram-based approaches. To perform these classification experiments three algorithms were used: Multinomial Naive Bayes (MNB), Support Vector Machines (SVM) and the J48 classifier.
  • Keywords
    Bayes methods; learning (artificial intelligence); natural language processing; pattern classification; support vector machines; text analysis; J48 classifier; MNB; SVM; automatic language identification; bag-of-words; character n-gram language models; classification experiments; general purpose language identification methods; language variety identification; machine learning algorithms; multinomial naive Bayes; similar languages; support vector machines; text classification; Accuracy; Computational modeling; Europe; Machine learning algorithms; Markov processes; Smoothing methods; Support vector machines;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Intelligence and Informatics (CINTI), 2013 IEEE 14th International Symposium on
  • Conference_Location
    Budapest
  • Print_ISBN
    978-1-4799-0194-4
  • Type

    conf

  • DOI
    10.1109/CINTI.2013.6705230
  • Filename
    6705230