مرکز منطقه ای اطلاع رساني علوم و فناوري - Error Detection in Highly Inflectional Languages

DocumentCode :

3489272

Title :

Error Detection in Highly Inflectional Languages

Author :

Sankaran, Naveen ; Jawahar, C.V.

Author_Institution :

Int. Inst. of Inf. Technol., Hyderabad, India

fYear :

2013

fDate :

25-28 Aug. 2013

Firstpage :

1135

Lastpage :

1139

Abstract :

Error detection in OCR output using dictionaries and statistical language models (SLMs) have become common practice for some time now, while designing post-processors. Multiple strategies have been used successfully in English to achieve this. However, this has not yet translated towards improving error detection performance in many inflectional languages, specially Indian languages. Challenges such as large unique word list, lack of linguistic resources, lack of reliable language models, etc. are some of the reasons for this. In this paper, we investigate the major challenges in developing error detection techniques for highly inflectional Indian languages. We compare and contrast several attributes of English with inflectional languages such as Telugu and Malayalam. We make observations by analyzing statistics computed from popular corpora and relate these observations to the error detection schemes. We propose a method which can detect errors for Telugu and Malayalam, with an F-Score comparable to some of the less inflectional languages like Hindi. Our method learns from the error patterns and SLMs.

Keywords :

error detection; natural language processing; optical character recognition; statistical analysis; English; F-Score; Malayalam; OCR output; SLM; Telugu; dictionaries; error detection performance; error detection techniques; highly inflectional languages; inflectional Indian languages; optical character recognition; statistical language models; Accuracy; Buildings; Dictionaries; Error correction; Hamming distance; Internet; Optical character recognition software; SLM; error detection; indian languages;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Document Analysis and Recognition (ICDAR), 2013 12th International Conference on

Conference_Location :

Washington, DC

ISSN :

1520-5363

Type :

conf

DOI :

10.1109/ICDAR.2013.230

Filename :

6628791

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3489272