مرکز منطقه ای اطلاع رساني علوم و فناوري - Semi-automated OCR database generation for Nabataean scripts

DocumentCode :

595038

Title :

Semi-automated OCR database generation for Nabataean scripts

Author :

Ul-Hasan, Adnan ; Bukhari, Syed Saqib ; Rashid, Sheikh Faisal ; Shafait, Faisal ; Breuel, Thomas M.

Author_Institution :

Tech. Univ. of Kaiserslautern, Kaiserslautern, Germany

fYear :

2012

fDate :

11-15 Nov. 2012

Firstpage :

1667

Lastpage :

1670

Abstract :

A large amount of real-world data is required to train and benchmark any character recognition algorithm. Developing a page-level ground-truth database for this purpose is overwhelmingly laborious, as it involves a lot of manual efforts to produce a reasonable database that covers all possible words of a language. Moreover, generating such a database for historical (degraded) documents or for a cursive script like Urdu¹ is even more complex and grueling. The presented work attempts to solve this problem by proposing a semi-automated technique for generating ground-truth database. It is believed that the proposed automation will greatly reduce the manual efforts for developing any OCR database. The basic idea is to apply ligature-clustering prior to manual labeling. Two prototype datasets for Urdu script have been developed using the proposed technique and the results are also presented.

Keywords :

database management systems; document image processing; natural language processing; optical character recognition; pattern clustering; Nabataean scripts; Urdu script; character recognition algorithm; cursive script; historical degraded documents; ligature clustering; manual labeling; page-level ground-truth database; real-world data; semiautomated OCR database generation; Accuracy; Clustering algorithms; Databases; Labeling; Manuals; Optical character recognition software; Shape;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Pattern Recognition (ICPR), 2012 21st International Conference on

Conference_Location :

Tsukuba

ISSN :

1051-4651

Print_ISBN :

978-1-4673-2216-4

Type :

conf

Filename :

6460468

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=595038