Title :
Error Detection in Highly Inflectional Languages
Author :
Sankaran, Naveen ; Jawahar, C.V.
Author_Institution :
Int. Inst. of Inf. Technol., Hyderabad, India
Abstract :
Error detection in OCR output using dictionaries and statistical language models (SLMs) have become common practice for some time now, while designing post-processors. Multiple strategies have been used successfully in English to achieve this. However, this has not yet translated towards improving error detection performance in many inflectional languages, specially Indian languages. Challenges such as large unique word list, lack of linguistic resources, lack of reliable language models, etc. are some of the reasons for this. In this paper, we investigate the major challenges in developing error detection techniques for highly inflectional Indian languages. We compare and contrast several attributes of English with inflectional languages such as Telugu and Malayalam. We make observations by analyzing statistics computed from popular corpora and relate these observations to the error detection schemes. We propose a method which can detect errors for Telugu and Malayalam, with an F-Score comparable to some of the less inflectional languages like Hindi. Our method learns from the error patterns and SLMs.
Keywords :
error detection; natural language processing; optical character recognition; statistical analysis; English; F-Score; Malayalam; OCR output; SLM; Telugu; dictionaries; error detection performance; error detection techniques; highly inflectional languages; inflectional Indian languages; optical character recognition; statistical language models; Accuracy; Buildings; Dictionaries; Error correction; Hamming distance; Internet; Optical character recognition software; SLM; error detection; indian languages;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/ICDAR.2013.230