Title :
Language identification and correction in corrupted texts of regional Indian languages
Author :
Yadav, Parmatma ; Kaur, Sukhpreet
Author_Institution :
Sci. Anal. Group, DRDO, New Delhi, India
Abstract :
In last few years there has been enormous increase in the volume of text documents related to different languages on the internet, intranet, digital libraries and news groups. Thus automatic identification of the language in these text documents is an important problem which is studied by many researchers. Sometimes these texts are in corrupted form, which might be due to OCR errors or transmission noise errors. In this paper, we present work related to Language Identification in corrupted texts of 11 regional Indian languages along with English. We have used: a) an n-gram language model to represent the language b) a distance measure based metric to correct the text and c)Bayesian Classifier to finally identify the language of the corrected text. The technique has been tested on texts of different lengths, different n-gram (3-gram, 4-gram and 5-gram) language models and different percentages of corrupted texts.
Keywords :
natural language processing; pattern classification; text analysis; 3-gram language model; 4-gram language model; 5-gram language model; Bayesian classifier; English; Internet; Intranet; OCR errors; automatic language identification; corrected text; corrupted texts; digital libraries; distance measure based metric; n-gram language model; news groups; regional Indian languages; text documents; transmission noise errors; Dictionaries; Hamming distance; Measurement; Natural language processing; Optical character recognition software; Probability; Training; Bayesian Classifier & Text Correction; Language Identification; Language Modeling; N-gram Analysis;
Conference_Titel :
Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013 International Conference
Conference_Location :
Gurgaon
DOI :
10.1109/ICSDA.2013.6709877