Title :
N-gram based algorithm for distinguishing between Hindi and Sanskrit texts
Author :
Sreejith, C. ; Indu, M. ; Raj, P. C. Reghu
Author_Institution :
Dept. of Comput. Sci. & Eng., Gov. Eng. Coll., Palakkad, India
Abstract :
Language Identification (LI) is the process of determining the natural language in which the given content is written. It is an important preprocessing step in many tasks of Natural Language Processing (NLP). In a multilingual society like India, automatic language identification has a wider scope, since it would be a vital step in bridging the digital divide between the Indian masses and others. In this paper, we present an N-gram based method of language identification for documents written in Hindi and Sanskrit, which have the same script and the results are shown. The technique can also be applied to other pairs of Indian languages sharing common scripts.
Keywords :
Digital Divide; natural language processing; text analysis; Hindi texts; Indian languages; N-gram based algorithm; N-gram based method; NLP; Sanskrit texts; automatic language identification; digital divide; multilingual society; natural language processing; preprocessing step; Educational institutions; Natural language processing; Pragmatics; Predictive models; Testing; Training; Hindi; Language identification; N-gram; Natural language processing; Sanskrit;
Conference_Titel :
Computing, Communications and Networking Technologies (ICCCNT),2013 Fourth International Conference on
Conference_Location :
Tiruchengode
Print_ISBN :
978-1-4799-3925-1
DOI :
10.1109/ICCCNT.2013.6726777