Title :
How to extend and bootstrap an existing data set with real-life degraded images
Author :
Phillips, Ihsin Tsaiyun
Author_Institution :
Dept. of Comput. Sci./Software Eng., Seattle Univ., WA, USA
Abstract :
This paper introduces a methodology for bootstrapping and creating large number of groundtruthed “real-life” degraded images from an existing data set with a fraction of the original cost and time. The real-life degradations include geometric distortions, coffee stains, water or ink marks, and folds and creases. The methodology includes an automatic procedure to generate unlimited “real-life” degraded images (with coffee and ink marks and soil spots) without any cost. A small experiment was conducted to illustrate the effectiveness of our methodology. In the experiment, 22 real-life degraded images and the two original images were tested on a commercial OCR system. The accuracy rates of the OCR for the two original pages are 98.46% and 99.34% while the accuracy rates for the degraded pages are ranging from 57.17% to 98.45%, depending on the severity and the type of degradation applied to the pages
Keywords :
document image processing; image enhancement; optical character recognition; automatic procedure; bootstrapping; coffee stains; commercial OCR system; creases; data set; document image understanding; folds; geometric distortions; ink marks; real-life degradations; real-life degraded images; soil spots; Costs; Degradation; Image databases; Image generation; Image recognition; Ink; Noise generators; Optical character recognition software; Spatial databases; System testing;
Conference_Titel :
Document Analysis and Recognition, 1999. ICDAR '99. Proceedings of the Fifth International Conference on
Conference_Location :
Bangalore
Print_ISBN :
0-7695-0318-7
DOI :
10.1109/ICDAR.1999.791881