• DocumentCode
    1513154
  • Title

    An optimization methodology for document structure extraction on Latin character documents

  • Author

    Liang, Jisheng ; Phillips, Ihsin T. ; Haralick, Robert M.

  • Author_Institution
    Insightful Corp., Seattle, WA, USA
  • Volume
    23
  • Issue
    7
  • fYear
    2001
  • fDate
    7/1/2001 12:00:00 AM
  • Firstpage
    719
  • Lastpage
    734
  • Abstract
    In this paper, we give a formal definition of a document image structure representation, and formulate document image structure extraction as a partitioning problem: finding an optimal solution partitioning the set of glyphs of an input document image into a hierarchical tree structure where entities within the hierarchy at each level have similar physical properties and compatible semantic labels. We present a unified methodology that is applicable to construction of document structures at different hierarchical levels. An iterative, relaxation-like method is used to find a partitioning solution that maximizes the probability of the extracted structure. All the probabilities used in the partitioning process are estimated from an extensive training set of various kinds of measurements among the entities within the hierarchy. The offline probabilities estimated in the training then drive all decisions in the online document structure extraction. We have implemented a text line extraction algorithm using this framework
  • Keywords
    document image processing; feature extraction; iterative methods; optimisation; probability; tree data structures; Latin character documents; document image structure; feature extraction; hierarchical tree structure; iterative method; optimization; partitioning process; probability; text line extraction; Area measurement; Computer Society; Image databases; Image segmentation; Image sequence analysis; Iterative methods; Optical character recognition software; Optimization methods; Partitioning algorithms; Tree data structures;
  • fLanguage
    English
  • Journal_Title
    Pattern Analysis and Machine Intelligence, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0162-8828
  • Type

    jour

  • DOI
    10.1109/34.935846
  • Filename
    935846