Title :
Using bag-of-words to distinguish similar languages: How efficient are they?
Author :
Zampieri, Marcos
Author_Institution :
Saarland Univ., Saarbrücken, Germany
Abstract :
This paper presents a number of experiments describing the use of machine learning algorithms and bag-of-words to the task of automatic language identification. The paper focuses on the identification of language varieties, which is a known weakness of general purpose language identification methods. This question was addressed by a number of studies in the recent years, most of them relying on character n-gram language models. In this paper, I experiment simple bag-of-words and compare the results with previously proposed n-gram-based approaches. To perform these classification experiments three algorithms were used: Multinomial Naive Bayes (MNB), Support Vector Machines (SVM) and the J48 classifier.
Keywords :
Bayes methods; learning (artificial intelligence); natural language processing; pattern classification; support vector machines; text analysis; J48 classifier; MNB; SVM; automatic language identification; bag-of-words; character n-gram language models; classification experiments; general purpose language identification methods; language variety identification; machine learning algorithms; multinomial naive Bayes; similar languages; support vector machines; text classification; Accuracy; Computational modeling; Europe; Machine learning algorithms; Markov processes; Smoothing methods; Support vector machines;
Conference_Titel :
Computational Intelligence and Informatics (CINTI), 2013 IEEE 14th International Symposium on
Conference_Location :
Budapest
Print_ISBN :
978-1-4799-0194-4
DOI :
10.1109/CINTI.2013.6705230