DocumentCode
1513154
Title
An optimization methodology for document structure extraction on Latin character documents
Author
Liang, Jisheng ; Phillips, Ihsin T. ; Haralick, Robert M.
Author_Institution
Insightful Corp., Seattle, WA, USA
Volume
23
Issue
7
fYear
2001
fDate
7/1/2001 12:00:00 AM
Firstpage
719
Lastpage
734
Abstract
In this paper, we give a formal definition of a document image structure representation, and formulate document image structure extraction as a partitioning problem: finding an optimal solution partitioning the set of glyphs of an input document image into a hierarchical tree structure where entities within the hierarchy at each level have similar physical properties and compatible semantic labels. We present a unified methodology that is applicable to construction of document structures at different hierarchical levels. An iterative, relaxation-like method is used to find a partitioning solution that maximizes the probability of the extracted structure. All the probabilities used in the partitioning process are estimated from an extensive training set of various kinds of measurements among the entities within the hierarchy. The offline probabilities estimated in the training then drive all decisions in the online document structure extraction. We have implemented a text line extraction algorithm using this framework
Keywords
document image processing; feature extraction; iterative methods; optimisation; probability; tree data structures; Latin character documents; document image structure; feature extraction; hierarchical tree structure; iterative method; optimization; partitioning process; probability; text line extraction; Area measurement; Computer Society; Image databases; Image segmentation; Image sequence analysis; Iterative methods; Optical character recognition software; Optimization methods; Partitioning algorithms; Tree data structures;
fLanguage
English
Journal_Title
Pattern Analysis and Machine Intelligence, IEEE Transactions on
Publisher
ieee
ISSN
0162-8828
Type
jour
DOI
10.1109/34.935846
Filename
935846
Link To Document