DocumentCode :
1513154
Title :
An optimization methodology for document structure extraction on Latin character documents
Author :
Liang, Jisheng ; Phillips, Ihsin T. ; Haralick, Robert M.
Author_Institution :
Insightful Corp., Seattle, WA, USA
Volume :
23
Issue :
7
fYear :
2001
fDate :
7/1/2001 12:00:00 AM
Firstpage :
719
Lastpage :
734
Abstract :
In this paper, we give a formal definition of a document image structure representation, and formulate document image structure extraction as a partitioning problem: finding an optimal solution partitioning the set of glyphs of an input document image into a hierarchical tree structure where entities within the hierarchy at each level have similar physical properties and compatible semantic labels. We present a unified methodology that is applicable to construction of document structures at different hierarchical levels. An iterative, relaxation-like method is used to find a partitioning solution that maximizes the probability of the extracted structure. All the probabilities used in the partitioning process are estimated from an extensive training set of various kinds of measurements among the entities within the hierarchy. The offline probabilities estimated in the training then drive all decisions in the online document structure extraction. We have implemented a text line extraction algorithm using this framework
Keywords :
document image processing; feature extraction; iterative methods; optimisation; probability; tree data structures; Latin character documents; document image structure; feature extraction; hierarchical tree structure; iterative method; optimization; partitioning process; probability; text line extraction; Area measurement; Computer Society; Image databases; Image segmentation; Image sequence analysis; Iterative methods; Optical character recognition software; Optimization methods; Partitioning algorithms; Tree data structures;
fLanguage :
English
Journal_Title :
Pattern Analysis and Machine Intelligence, IEEE Transactions on
Publisher :
ieee
ISSN :
0162-8828
Type :
jour
DOI :
10.1109/34.935846
Filename :
935846
Link To Document :
بازگشت