Title :
Column segmentation by white space pattern matching
Author_Institution :
Fujii Xerox Palo Alto Lab., CA, USA
Abstract :
Model-based column segmentation is described. Sequences of horizontal white space across a column are used as the basic features. Structures of columns in a specific publication are described by two levels of regular expressions: column expressions (CE) and element expressions (EE). Additional spatial constraints for element attributes can be described. A CE represents patterns of element sequences. An EE represents patterns of white space sequences for each element type. Segmentation is performed in three steps: element candidate extraction using EEs, column structure verification using the CE and ranking by comparison with statistical data. Experiments were performed on columns in two different scientific journals. More than 70% of the columns were correctly segmented as the top choice and more than 87% were in the top three choices. When spatial constraints were applied to element attributes, the rate was more than 90%
Keywords :
document image processing; feature extraction; image segmentation; optical character recognition; pattern matching; statistical analysis; column expressions; column structure verification; element attributes; element candidate extraction; element expressions; experiments; horizontal white space sequences; model-based column segmentation; ranking; scientific journals; spatial constraints; statistical data; white space pattern matching; Data mining; Image segmentation; Laboratories; Pattern matching; Robustness; Tagging; White spaces;
Conference_Titel :
Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on
Conference_Location :
Montreal, Que.
Print_ISBN :
0-8186-7128-9
DOI :
10.1109/ICDAR.1995.598960