DocumentCode :
1202067
Title :
Selecting a restoration technique to minimize OCR error
Author :
Cannon, Mike ; Fugate, Mike ; Hush, Don R. ; Scovel, Clint
Author_Institution :
Comput. Res. & Applications Group, Los Alamos Nat. Lab., NM, USA
Volume :
14
Issue :
3
fYear :
2003
fDate :
5/1/2003 12:00:00 AM
Firstpage :
478
Lastpage :
490
Abstract :
This paper introduces a learning problem related to the task of converting printed documents to ASCII text files. The goal of the learning procedure is to produce a function that maps documents to restoration techniques in such a way that on average the restored documents have minimum optical character recognition error. We derive a general form for the optimal function and use it to motivate the development of a nonparametric method based on nearest neighbors. We also develop a direct method of solution based on empirical error minimization for which we prove a finite sample bound on estimation error that is independent of distribution. We show that this empirical error minimization problem is an extension of the empirical optimization problem for traditional M-class classification with general loss function and prove computational hardness for this problem. We then derive a simple iterative algorithm called generalized multiclass ratchet (GMR) and prove that it produces an optimal function asymptotically (with probability 1). To obtain the GMR algorithm we introduce a new data map that extends Kesler´s construction for the multiclass problem and then apply an algorithm called Ratchet to this mapped data, where Ratchet is a modification of the Pocket algorithm . Finally, we apply these methods to a collection of documents and report on the experimental results.
Keywords :
document image processing; image restoration; iterative methods; optical character recognition; ASCII text files; GMR; OCR error minimization; computational hardness; finite sample bound; generalized multiclass ratchet; iterative algorithm; minimum optical character recognition error; nearest neighbors method; nonparametric method; printed documents; restoration technique; Character recognition; Error analysis; Estimation error; Iterative algorithms; Minimization methods; Nearest neighbor searches; Optical character recognition software; Optical distortion; Optical noise; Pipelines;
fLanguage :
English
Journal_Title :
Neural Networks, IEEE Transactions on
Publisher :
ieee
ISSN :
1045-9227
Type :
jour
DOI :
10.1109/TNN.2003.811711
Filename :
1199647
Link To Document :
بازگشت