DocumentCode :
1585360
Title :
Synthetic data for Arabic OCR system development
Author :
Märgner, V. ; Pechwitz, M.
Author_Institution :
Inst. for Commun. Technol., Technische Univ. Braunschweig, Germany
fYear :
2001
fDate :
6/23/1905 12:00:00 AM
Firstpage :
1159
Lastpage :
1163
Abstract :
A system for the automatic generation of synthetic databases for the development or evaluation of Arabic word or text recognition systems (Arabic OCR) is presented. The proposed system works without any scanning of printed paper. Firstly Arabic text has to be typeset using a standard typesetting system. Secondly a noise-free bitmap of the document and the corresponding ground truth (GT) is automatically generated. Finally, an image distortion can be superimposed to the character or word image to simulate the expected real world noise of the intended application. All necessary modules are presented together with some examples. Special problems caused by specific features of Arabic, such as printing from right to left, many diacritical points, variation in the height of characters, and changes in the relative position to the writing line, are suggested. The synthetic data set was used to train and test a recognition system based on hidden Markov model (HMM), which was originally developed for German cursive script, for Arabic printed words. Recognition results with different synthetic data sets are presented
Keywords :
computer graphics; hidden Markov models; optical character recognition; visual databases; Arabic OCR system development; HMM; character height; diacritical points; hidden Markov model; image distortion; noise; noise-free bitmap; synthetic data; synthetic database generation; text recognition systems; word recognition systems; Character generation; Communications technology; Databases; Hidden Markov models; Noise generators; Optical character recognition software; System testing; Text recognition; Typesetting; Writing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on
Conference_Location :
Seattle, WA
Print_ISBN :
0-7695-1263-1
Type :
conf
DOI :
10.1109/ICDAR.2001.953967
Filename :
953967
Link To Document :
بازگشت