OCR for bilingual documents using language modeling

Author

Anupama Ray;Sai Rajeswar;Santanu Chaudhury

Author_Institution

Department of Electrical Engineering, Indian Institute of Technology Delhi, India

fYear

2015

Firstpage

1256

Lastpage

1260

Abstract

Script based features are highly discriminative for text segmentation and recognition. Thus they are widely used in Optical Character Recognition(OCR) problems. But usage of script dependent features restricts the adaptation of such architectures directly for another script. With script independent systems, this problem can be solved to a certain extent for monolingual documents. But the problem aggravates in case of multilingual documents as it is very difficult for a single classifier to learn many scripts. Generally a script identification module identifies text segments and accordingly the script-dependent classifier is selected. This paper presents a unified framework of language model and multiple preprocessing hypotheses for word recognition from bilingual document images. Prior to text recognition, preprocessing steps such as binarization and segmentation are required for ease of recognition. But these steps induce huge combinatorial error propagating to final recognition accuracy. In this paper we use multiple preprocessing routines as alternate hypotheses and use a language model to verify each alternative and choose the best recognized sequence. We test this architecture for word recognition of Kannada-English and Telugu-English bilingual documents and achieved better recognition rates than single methods using same classifier.

Keywords

"Chlorine","Optical character recognition software","Adaptation models","Lead","Speech"

Publisher

ieee

Conference_Titel

Document Analysis and Recognition (ICDAR), 2015 13th International Conference on

Type

conf

DOI

10.1109/ICDAR.2015.7333965

Filename

7333965