Title :
Document image summarization without OCR
Author :
Bloomberg, Dan S. ; Chen, Francine R.
Author_Institution :
Xerox Palo Alto Res. Center, CA, USA
Abstract :
A system for selecting excerpts directly from imaged text without performing optical character recognition is described. The images are segmented to find text regions, text lines and words, and sentence and paragraph boundaries are identified. A set of word equivalence classes is computed based on the rank blur hit-miss transform. This information is used to identify stop words and keywords. Sentences for presentation as part of a summary are then selected based on keywords and on the location of the sentences
Keywords :
document image processing; image segmentation; transforms; document image summarization; image segmentation; imaged text; keywords; paragraph boundaries; rank blur hit-miss transform; sentence; stop word identification; text lines; text regions; word equivalence classes; words; Character generation; Character recognition; Data mining; Graphics; Image analysis; Image processing; Image segmentation; Natural languages; Optical character recognition software; Shape;
Conference_Titel :
Image Processing, 1996. Proceedings., International Conference on
Conference_Location :
Lausanne
Print_ISBN :
0-7803-3259-8
DOI :
10.1109/ICIP.1996.560744