DocumentCode
2060505
Title
A retargetable table reader
Author
Shamilian, John H. ; Baird, Henry S. ; Wood, Thomas L.
Author_Institution
Lucent Technol. Inc., AT&T Bell Labs., Holmdel, NJ, USA
Volume
1
fYear
1997
fDate
18-20 Aug 1997
Firstpage
158
Abstract
We describe the architecture of a system for reading machine-printed documents in known predefined tabular-data layout styles. In these tables, textual data are presented in record lines made up of fixed-width fields. Tables often do not rely on line-art (ruled lines) to delimit fields, and in this way differ crucially from fixed forms. Our system performs these steps: copes with multiple tables per page; identifies records within tables; segments records into fields; and recognizes characters within fields, constrained by field-specific contextual knowledge. Obstacles to good performance on tables include small print, tight line-spacing, poor-quality text (such as photocopies), and line-art or background patterns that touch the text. Precise skew-correction and pitch-estimation, and high-performance OCR using neural nets proved crucial in overcoming these obstacles. The most significant technical advances in this work appear to be algorithms for identifying and segmenting records with known layout, and integration of these algorithms with a graphical user interface (GUI) for defining new layouts. This GUI has been ergonomically designed to make efficient and intuitive use of exemplary images, so that the skill and manual effort required to retarget the system to new table layouts are held to a minimum. The system has been applied in this way to more than 400 distinct tabular layouts. During the last three years the system has read over fifty million records with high accuracy
Keywords
document image processing; image segmentation; neural nets; optical character recognition; background patterns; field-specific contextual knowledge; fixed-width fields; graphical user interface; high-performance OCR; line-art; machine-printed documents; neural nets; photocopies; pitch-estimation; predefined tabular-data layout; record lines; retargetable table reader; segmentation; skew-correction; small print; textual data; tight line-spacing; Business; Character recognition; Finance; Graphical user interfaces; Image segmentation; Layout; Medical services; Neural networks; Optical character recognition software; Telecommunications;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on
Conference_Location
Ulm
Print_ISBN
0-8186-7898-4
Type
conf
DOI
10.1109/ICDAR.1997.619833
Filename
619833
Link To Document