مرکز منطقه ای اطلاع رساني علوم و فناوري - Improving state-of-the-art OCR through high-precision document-specific modeling

DocumentCode :

3403429

Title :

Improving state-of-the-art OCR through high-precision document-specific modeling

Author :

Kae, Andrew ; Huang, Gary ; Doersch, Carl ; Learned-Miller, Erik

Author_Institution :

Dept. of Comput. Sci., Univ. of Massachusetts, Amherst, MA, USA

fYear :

2010

fDate :

13-18 June 2010

Firstpage :

1935

Lastpage :

1942

Abstract :

Optical character recognition (OCR) remains a difficult problem for noisy documents or documents not scanned at high resolution. Many current approaches rely on stored font models that are vulnerable to cases in which the document is noisy or is written in a font dissimilar to the stored fonts. We address these problems by learning character models directly from the document itself, rather than using pre-stored font models. This method has had some success in the past, but we are able to achieve substantial improvement in error reduction through a novel method for creating nearly error-free document-specific training data and building character appearance models from this data. In particular, we first use the state-of-the-art OCR system Tesseract to produce an initial translation. Then, our method identifies a subset of words that we have high confidence have been recognized correctly and uses this subset to bootstrap document-specific character models. We present theoretical justification that a word in the selected subset is very unlikely to be incorrectly recognized, and empirical results on a data set of difficult historical newspaper scans demonstrating that we make only two errors in 56 documents. We then relax the theoretical constraint in order to create a larger training set, and using document-specific character models generated from this data, we are able to reduce the error over properly segmented characters by 34.1% overall from the initial Tesseract translation.

Keywords :

character recognition; document handling; Tesseract translation; character appearance; document specific modeling; font models; noisy documents; optical character recognition; state-of-the-art OCR; Character generation; Character recognition; Computer science; Constraint theory; Error analysis; Error correction; Optical character recognition software; Optical noise; Training data;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on

Conference_Location :

San Francisco, CA

ISSN :

1063-6919

Print_ISBN :

978-1-4244-6984-0

Type :

conf

DOI :

10.1109/CVPR.2010.5539867

Filename :

5539867

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3403429