Language identification and correction in corrupted texts of regional Indian languages

Author

Yadav, Parmatma ; Kaur, Sukhpreet

Author_Institution

Sci. Anal. Group, DRDO, New Delhi, India

fYear

2013

fDate

25-27 Nov. 2013

Firstpage

1

Lastpage

5

Abstract

In last few years there has been enormous increase in the volume of text documents related to different languages on the internet, intranet, digital libraries and news groups. Thus automatic identification of the language in these text documents is an important problem which is studied by many researchers. Sometimes these texts are in corrupted form, which might be due to OCR errors or transmission noise errors. In this paper, we present work related to Language Identification in corrupted texts of 11 regional Indian languages along with English. We have used: a) an n-gram language model to represent the language b) a distance measure based metric to correct the text and c)Bayesian Classifier to finally identify the language of the corrected text. The technique has been tested on texts of different lengths, different n-gram (3-gram, 4-gram and 5-gram) language models and different percentages of corrupted texts.

Keywords

natural language processing; pattern classification; text analysis; 3-gram language model; 4-gram language model; 5-gram language model; Bayesian classifier; English; Internet; Intranet; OCR errors; automatic language identification; corrected text; corrupted texts; digital libraries; distance measure based metric; n-gram language model; news groups; regional Indian languages; text documents; transmission noise errors; Dictionaries; Hamming distance; Measurement; Natural language processing; Optical character recognition software; Probability; Training; Bayesian Classifier & Text Correction; Language Identification; Language Modeling; N-gram Analysis;

fLanguage

English

Publisher

ieee

Conference_Titel

Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013 International Conference

Conference_Location

Gurgaon

Type

conf

DOI

10.1109/ICSDA.2013.6709877

Filename

6709877