• DocumentCode
    594717
  • Title

    Document recognition and translation system for unconstrained Arabic documents

  • Author

    Huaigu Cao ; Jinying Chen ; Devlin, John ; Prasad, Ranga ; Natarajan, Prem

  • Author_Institution
    Raytheon BBN Technol., Cambridge, MA, USA
  • fYear
    2012
  • fDate
    11-15 Nov. 2012
  • Firstpage
    318
  • Lastpage
    321
  • Abstract
    We describe an end-to-end system for translating real-world Arabic field documents that contain a mix of handwritten and printed content into English. These documents are extremely challenging to recognize due to presence of noise, poor image capture quality, and variations in writing style, writing device, font, layout, genre, etc. Furthermore, no off-the-shelf machine translation (MT) engine is available to translate these documents into English. We present key innovations for dealing with these challenges for document preprocessing, text line segmentation, and text recognition. In addition, we describe our approach for adapting MT using a limited amount of in-domain training data that results in significant improvements in translating accuracy.
  • Keywords
    document image processing; handwritten character recognition; image segmentation; language translation; natural language processing; text detection; Arabic-to-English translation; document preprocessing; document recognition system; document translation system; end-to-end system; font variations; genre variations; handwritten content; image capture quality; in-domain training data; layout variations; machine translation; noisy image; printed content; real-world Arabic document translation; text line segmentation; text recognition; unconstrained Arabic documents; writing device; writing style variations; Adaptation models; Hidden Markov models; Noise; Optical character recognition software; Text analysis; Text recognition; Training;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Pattern Recognition (ICPR), 2012 21st International Conference on
  • Conference_Location
    Tsukuba
  • ISSN
    1051-4651
  • Print_ISBN
    978-1-4673-2216-4
  • Type

    conf

  • Filename
    6460136