DocumentCode :
3485870
Title :
Creating an Improved Version Using Noisy OCR from Multiple Editions
Author :
Wemhoener, David ; Yalniz, Ismet Zeki ; Manmatha, R.
Author_Institution :
Sch. of Comput. Sci., Univ. of Massachusetts, Amherst, MA, USA
fYear :
2013
fDate :
25-28 Aug. 2013
Firstpage :
160
Lastpage :
164
Abstract :
This paper evaluates an automated scheme for aligning and combining optical character recognition (OCR) output from three scans of a book to generate a composite version with fewer OCR errors. While there has been some previous work on aligning multiple OCR versions of the same scan, the scheme introduced in this paper does not require that scans be from the same copy of the book, or even the same edition. The three OCR outputs are combined using an algorithm which builds upon an technique which aligns two sequences at a time. In the algorithm a multiple sequence alignment of the scans is generated by stitching together pair wise alignments and is used in turn to construct a corrected text. The approach is able to correct OCR errors so long as they do not occur in multiple scans. The proposed approach is shown to be effective even if some of the books contain additional content such as introductions or commentary. This scheme is used to generate improved versions from OCR texts taken from the Internet Archive. The accuracy of the original scans and the composite text are evaluated by comparing them to the version available from Project Gutenberg.
Keywords :
digital libraries; image sequences; optical character recognition; Internet Archive; OCR error correction; OCR outputs; OCR texts; automatic optical character recognition alignment scheme; automatic optical character recognition combination scheme; book commentary content; book introductions; book scan accuracy evaluation; composite text accuracy evaluation; multiple sequence scan alignment; noisy OCR; pairwise alignments; Accuracy; Context; Educational institutions; Error analysis; Error correction; Internet; Optical character recognition software; OCR error correction; scanned book collections; sequence alignment;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location :
Washington, DC
ISSN :
1520-5363
Type :
conf
DOI :
10.1109/ICDAR.2013.39
Filename :
6628604
Link To Document :
بازگشت