DocumentCode :
1477782
Title :
An automatic closed-loop methodology for generating character groundtruth for scanned documents
Author :
Kanungo, Tapas ; Haralick, Robert M.
Author_Institution :
Center for Autom. Res., Maryland Univ., College Park, MD, USA
Volume :
21
Issue :
2
fYear :
1999
fDate :
2/1/1999 12:00:00 AM
Firstpage :
179
Lastpage :
183
Abstract :
Character groundtruth for real, scanned document images is crucial for evaluating the performance of OCR systems, training OCR algorithms, and validating document degradation models. Unfortunately, manual collection of accurate groundtruth for characters in a real (scanned) document image is not practical because (i) accuracy in delineating groundtruth character bounding boxes is not high enough, (ii) it is extremely laborious and time consuming, and (iii) the manual labor required for this task is prohibitively expensive. Ee describe a closed-loop methodology for collecting very accurate groundtruth for scanned documents. We first create ideal documents using a typesetting language. Next we create the groundtruth for the ideal document. The ideal document is then printed, photocopied and then scanned. A registration algorithm estimates the global geometric transformation and then performs a robust local bitmap match to register the ideal document image to the scanned document image. Finally, groundtruth associated with the ideal document image is transformed using the estimated geometric transformation to create the groundtruth for the scanned document image. This methodology is very general and can be used for creating groundtruth for documents in typeset in any language, layout, font, and style. We have demonstrated the method by generating groundtruth for English, Hindi, and FAX document images. The cost of creating groundtruth using our methodology is minimal. If character, word or zone groundtruth is available for any real document, the registration algorithm can be used to generate the corresponding groundtruth for a rescanned version of the document
Keywords :
document image processing; image matching; image registration; optical character recognition; English; FAX document images; Hindi; OCR systems; automatic closed-loop methodology; character groundtruth; document degradation models; global geometric transformation; registration algorithm; robust local bitmap match; scanned documents; Character generation; Data mining; Degradation; Image analysis; Image generation; Image registration; Optical character recognition software; Sensor fusion; Text analysis; Typesetting;
fLanguage :
English
Journal_Title :
Pattern Analysis and Machine Intelligence, IEEE Transactions on
Publisher :
ieee
ISSN :
0162-8828
Type :
jour
DOI :
10.1109/34.748827
Filename :
748827
Link To Document :
بازگشت