DocumentCode
594717
Title
Document recognition and translation system for unconstrained Arabic documents
Author
Huaigu Cao ; Jinying Chen ; Devlin, John ; Prasad, Ranga ; Natarajan, Prem
Author_Institution
Raytheon BBN Technol., Cambridge, MA, USA
fYear
2012
fDate
11-15 Nov. 2012
Firstpage
318
Lastpage
321
Abstract
We describe an end-to-end system for translating real-world Arabic field documents that contain a mix of handwritten and printed content into English. These documents are extremely challenging to recognize due to presence of noise, poor image capture quality, and variations in writing style, writing device, font, layout, genre, etc. Furthermore, no off-the-shelf machine translation (MT) engine is available to translate these documents into English. We present key innovations for dealing with these challenges for document preprocessing, text line segmentation, and text recognition. In addition, we describe our approach for adapting MT using a limited amount of in-domain training data that results in significant improvements in translating accuracy.
Keywords
document image processing; handwritten character recognition; image segmentation; language translation; natural language processing; text detection; Arabic-to-English translation; document preprocessing; document recognition system; document translation system; end-to-end system; font variations; genre variations; handwritten content; image capture quality; in-domain training data; layout variations; machine translation; noisy image; printed content; real-world Arabic document translation; text line segmentation; text recognition; unconstrained Arabic documents; writing device; writing style variations; Adaptation models; Hidden Markov models; Noise; Optical character recognition software; Text analysis; Text recognition; Training;
fLanguage
English
Publisher
ieee
Conference_Titel
Pattern Recognition (ICPR), 2012 21st International Conference on
Conference_Location
Tsukuba
ISSN
1051-4651
Print_ISBN
978-1-4673-2216-4
Type
conf
Filename
6460136
Link To Document