Title :
Synthetic data for Arabic OCR system development
Author :
Märgner, V. ; Pechwitz, M.
Author_Institution :
Inst. for Commun. Technol., Technische Univ. Braunschweig, Germany
fDate :
6/23/1905 12:00:00 AM
Abstract :
A system for the automatic generation of synthetic databases for the development or evaluation of Arabic word or text recognition systems (Arabic OCR) is presented. The proposed system works without any scanning of printed paper. Firstly Arabic text has to be typeset using a standard typesetting system. Secondly a noise-free bitmap of the document and the corresponding ground truth (GT) is automatically generated. Finally, an image distortion can be superimposed to the character or word image to simulate the expected real world noise of the intended application. All necessary modules are presented together with some examples. Special problems caused by specific features of Arabic, such as printing from right to left, many diacritical points, variation in the height of characters, and changes in the relative position to the writing line, are suggested. The synthetic data set was used to train and test a recognition system based on hidden Markov model (HMM), which was originally developed for German cursive script, for Arabic printed words. Recognition results with different synthetic data sets are presented
Keywords :
computer graphics; hidden Markov models; optical character recognition; visual databases; Arabic OCR system development; HMM; character height; diacritical points; hidden Markov model; image distortion; noise; noise-free bitmap; synthetic data; synthetic database generation; text recognition systems; word recognition systems; Character generation; Communications technology; Databases; Hidden Markov models; Noise generators; Optical character recognition software; System testing; Text recognition; Typesetting; Writing;
Conference_Titel :
Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on
Conference_Location :
Seattle, WA
Print_ISBN :
0-7695-1263-1
DOI :
10.1109/ICDAR.2001.953967