DocumentCode :
3489272
Title :
Error Detection in Highly Inflectional Languages
Author :
Sankaran, Naveen ; Jawahar, C.V.
Author_Institution :
Int. Inst. of Inf. Technol., Hyderabad, India
fYear :
2013
fDate :
25-28 Aug. 2013
Firstpage :
1135
Lastpage :
1139
Abstract :
Error detection in OCR output using dictionaries and statistical language models (SLMs) have become common practice for some time now, while designing post-processors. Multiple strategies have been used successfully in English to achieve this. However, this has not yet translated towards improving error detection performance in many inflectional languages, specially Indian languages. Challenges such as large unique word list, lack of linguistic resources, lack of reliable language models, etc. are some of the reasons for this. In this paper, we investigate the major challenges in developing error detection techniques for highly inflectional Indian languages. We compare and contrast several attributes of English with inflectional languages such as Telugu and Malayalam. We make observations by analyzing statistics computed from popular corpora and relate these observations to the error detection schemes. We propose a method which can detect errors for Telugu and Malayalam, with an F-Score comparable to some of the less inflectional languages like Hindi. Our method learns from the error patterns and SLMs.
Keywords :
error detection; natural language processing; optical character recognition; statistical analysis; English; F-Score; Malayalam; OCR output; SLM; Telugu; dictionaries; error detection performance; error detection techniques; highly inflectional languages; inflectional Indian languages; optical character recognition; statistical language models; Accuracy; Buildings; Dictionaries; Error correction; Hamming distance; Internet; Optical character recognition software; SLM; error detection; indian languages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location :
Washington, DC
ISSN :
1520-5363
Type :
conf
DOI :
10.1109/ICDAR.2013.230
Filename :
6628791
Link To Document :
بازگشت