Title :
Benchmarking commercial OCR engines for technical drawings indexing
Author :
Lecoq, J.C. ; Najman, L. ; Gibot, O. ; Trupin, E.
Author_Institution :
PSI, Insa de Rouen, Saint-Etienne, France
fDate :
6/23/1905 12:00:00 AM
Abstract :
The choice of a commercial optical character recognition (OCR) engine is important for the process of automatically indexing technical drawings from their title blocks. We would like to benchmark commercial OCR engines with respect to their inclusion in the global digitalisation chain from scanning to understanding the text information contained in a technical drawing document. The crucial (costly) point is the manual correction of OCR recognition errors. By benchmarking, we intend to identify, for our application domain, the causes for OCR errors which are the most costly to correct. For a given OCR engine, we model the correction cost as a function of image characteristics. Thus, our methodology relies on the two following issues: on the one hand, the design of the correction cost, representing the difficulty of correction for a human operator; on the other hand, the classification of image characteristics that may lead to OCR recognition errors. We choose to analyse the behaviour of this correction cost by principal component analysis (PCA), comparing two by two the engines to discover their complementarity. This methodology allows us to obtain a list of domain-dependant problems for OCR engines, classified by importance with respect to the correction cost. This list could then be used to correctly choose the OCR engine, or to enhance the OCR execution, by focusing on the most important problems. While we are confident it could easily be implemented for other document classes, we apply this methodology to the domain of technical drawings, and find that our OCR engines are not adapted to our problem
Keywords :
database indexing; document image processing; engineering graphics; image classification; optical character recognition; principal component analysis; software performance evaluation; visual databases; OCR engines; benchmarking; database; document scanning; error correction; image characteristics; image classification; optical character recognition engine; principal component analysis; technical drawing indexing; text information; Character recognition; Cost function; Engines; Error correction; Humans; Image recognition; Indexing; Optical character recognition software; Principal component analysis; Technical drawing;
Conference_Titel :
Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on
Conference_Location :
Seattle, WA
Print_ISBN :
0-7695-1263-1
DOI :
10.1109/ICDAR.2001.953770