DocumentCode :
2192633
Title :
Retrieval methods for English-text with missrecognized OCR characters
Author :
Ohta, M. ; Takasu, A. ; Adachi, J.
Author_Institution :
Grad. Sch. of Eng., Tokyo Univ., Japan
Volume :
2
fYear :
1997
fDate :
18-20 Aug 1997
Firstpage :
950
Abstract :
This paper presents three probabilistic text retrieval methods designed to carry out a full-text search of English documents containing OCR errors. By searching for any query term on the premise that there are errors in the recognized text, the methods presented can tolerate such errors, and therefore costly manual post-editing is not required after OCR recognition. In the applied approach, confusion matrices are used to store characters which are likely to be interchanged when a particular character is missrecognized, and the respective probability of each occurrence. Moreover, a 2-gram matrix is used to store probabilities of character connection, i.e., which letter is likely to come after another. Multiple search terms are generated for an input query term by making reference to confusion matrices, after which a full-text search is run for each search term. The validity of retrieved terms is determined based on error-occurrence and character connection probabilities. The performance of these methods is experimentally evaluated by determining retrieval effectiveness, i.e., by calculating recall and precision rates. Results indicate marked improvement in comparison with exact matching
Keywords :
character recognition; document handling; information retrieval; optical character recognition; English-text; OCR errors; character connection; confusion matrices; full-text search; input query term; missrecognized OCR characters; text retrieval methods; Character recognition; Chromium; Design engineering; Design methodology; Error correction; Image databases; Information retrieval; Information systems; Optical character recognition software; Text recognition;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on
Conference_Location :
Ulm
Print_ISBN :
0-8186-7898-4
Type :
conf
DOI :
10.1109/ICDAR.1997.620651
Filename :
620651
Link To Document :
بازگشت