مرکز منطقه ای اطلاع رساني علوم و فناوري - N-gram based algorithm for distinguishing between Hindi and Sanskrit texts

DocumentCode :

3127135

Title :

N-gram based algorithm for distinguishing between Hindi and Sanskrit texts

Author :

Sreejith, C. ; Indu, M. ; Raj, P. C. Reghu

Author_Institution :

Dept. of Comput. Sci. & Eng., Gov. Eng. Coll., Palakkad, India

fYear :

2013

fDate :

4-6 July 2013

Firstpage :

Lastpage :

Abstract :

Language Identification (LI) is the process of determining the natural language in which the given content is written. It is an important preprocessing step in many tasks of Natural Language Processing (NLP). In a multilingual society like India, automatic language identification has a wider scope, since it would be a vital step in bridging the digital divide between the Indian masses and others. In this paper, we present an N-gram based method of language identification for documents written in Hindi and Sanskrit, which have the same script and the results are shown. The technique can also be applied to other pairs of Indian languages sharing common scripts.

Keywords :

Digital Divide; natural language processing; text analysis; Hindi texts; Indian languages; N-gram based algorithm; N-gram based method; NLP; Sanskrit texts; automatic language identification; digital divide; multilingual society; natural language processing; preprocessing step; Educational institutions; Natural language processing; Pragmatics; Predictive models; Testing; Training; Hindi; Language identification; N-gram; Natural language processing; Sanskrit;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computing, Communications and Networking Technologies (ICCCNT),2013 Fourth International Conference on

Conference_Location :

Tiruchengode

Print_ISBN :

978-1-4799-3925-1

Type :

conf

DOI :

10.1109/ICCCNT.2013.6726777

Filename :

6726777

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3127135