• DocumentCode
    3695295
  • Title

    OCR for bilingual documents using language modeling

  • Author

    Anupama Ray;Sai Rajeswar;Santanu Chaudhury

  • Author_Institution
    Department of Electrical Engineering, Indian Institute of Technology Delhi, India
  • fYear
    2015
  • Firstpage
    1256
  • Lastpage
    1260
  • Abstract
    Script based features are highly discriminative for text segmentation and recognition. Thus they are widely used in Optical Character Recognition(OCR) problems. But usage of script dependent features restricts the adaptation of such architectures directly for another script. With script independent systems, this problem can be solved to a certain extent for monolingual documents. But the problem aggravates in case of multilingual documents as it is very difficult for a single classifier to learn many scripts. Generally a script identification module identifies text segments and accordingly the script-dependent classifier is selected. This paper presents a unified framework of language model and multiple preprocessing hypotheses for word recognition from bilingual document images. Prior to text recognition, preprocessing steps such as binarization and segmentation are required for ease of recognition. But these steps induce huge combinatorial error propagating to final recognition accuracy. In this paper we use multiple preprocessing routines as alternate hypotheses and use a language model to verify each alternative and choose the best recognized sequence. We test this architecture for word recognition of Kannada-English and Telugu-English bilingual documents and achieved better recognition rates than single methods using same classifier.
  • Keywords
    "Chlorine","Optical character recognition software","Adaptation models","Lead","Speech"
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2015 13th International Conference on
  • Type

    conf

  • DOI
    10.1109/ICDAR.2015.7333965
  • Filename
    7333965