DocumentCode :
595038
Title :
Semi-automated OCR database generation for Nabataean scripts
Author :
Ul-Hasan, Adnan ; Bukhari, Syed Saqib ; Rashid, Sheikh Faisal ; Shafait, Faisal ; Breuel, Thomas M.
Author_Institution :
Tech. Univ. of Kaiserslautern, Kaiserslautern, Germany
fYear :
2012
fDate :
11-15 Nov. 2012
Firstpage :
1667
Lastpage :
1670
Abstract :
A large amount of real-world data is required to train and benchmark any character recognition algorithm. Developing a page-level ground-truth database for this purpose is overwhelmingly laborious, as it involves a lot of manual efforts to produce a reasonable database that covers all possible words of a language. Moreover, generating such a database for historical (degraded) documents or for a cursive script like Urdu1 is even more complex and grueling. The presented work attempts to solve this problem by proposing a semi-automated technique for generating ground-truth database. It is believed that the proposed automation will greatly reduce the manual efforts for developing any OCR database. The basic idea is to apply ligature-clustering prior to manual labeling. Two prototype datasets for Urdu script have been developed using the proposed technique and the results are also presented.
Keywords :
database management systems; document image processing; natural language processing; optical character recognition; pattern clustering; Nabataean scripts; Urdu script; character recognition algorithm; cursive script; historical degraded documents; ligature clustering; manual labeling; page-level ground-truth database; real-world data; semiautomated OCR database generation; Accuracy; Clustering algorithms; Databases; Labeling; Manuals; Optical character recognition software; Shape;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Pattern Recognition (ICPR), 2012 21st International Conference on
Conference_Location :
Tsukuba
ISSN :
1051-4651
Print_ISBN :
978-1-4673-2216-4
Type :
conf
Filename :
6460468
Link To Document :
بازگشت