Title :
Representing OCRed documents in HTML
Author :
Hong, Tao ; Srihari, Sargur N.
Author_Institution :
Microsoft Corp., Redmond, WA, USA
Abstract :
OCR is an error-prone process. It is time-consuming and expensive to manually proofread OCR results. The errors remaining in OCRed texts can cause serious problems in reading and understanding if they do not refer to the original image representation. As demonstrated in this paper, a hybrid document which combines symbolic representation and image representation may relieve the problem. If we represent a OCRed document properly in HTML, OCR errors will not have much negative effect on the human reading process in an HTML browser and can be corrected by using an HTML authoring tool. Under this approach, an experiment evaluating a Japanese OCR system developed at CEDAR is also reported in this paper
Keywords :
authoring systems; document image processing; hypermedia; image representation; optical character recognition; page description languages; HTML authoring tool; HTML browser; Japanese OCR system evaluation; OCR errors; document representation; error correction; human reading process; hybrid document; image representation; symbolic representation; text errors; Character recognition; Error correction; Graphical user interfaces; HTML; Humans; Image quality; Image representation; Image segmentation; Optical character recognition software; Text categorization;
Conference_Titel :
Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on
Conference_Location :
Ulm
Print_ISBN :
0-8186-7898-4
DOI :
10.1109/ICDAR.1997.620628