• DocumentCode
    3020925
  • Title

    Document understanding system using stochastic context-free grammars

  • Author

    Handley, John C. ; Namboodiri, Anoop M. ; Zanibbi, Richard

  • Author_Institution
    Xerox Corp., Webster, NY, USA
  • fYear
    2005
  • fDate
    29 Aug.-1 Sept. 2005
  • Firstpage
    511
  • Abstract
    We present a document understanding system in which the arrangement of lines of text and block separators within a document are modeled by stochastic context free grammars. A grammar corresponds to a document genre; our system may be adapted to a new genre simply by replacing the input grammar. The system incorporates an optical character recognition system that outputs characters, their positions and font sizes. These features are combined to form a document representation of lines of text and separators. Lines of text are labeled as tokens using regular expression matching. The maximum likelihood parse of this stream of tokens and separators yields a functional labeling of the document lines. We describe business card and business letter applications.
  • Keywords
    context-free grammars; document image processing; feature extraction; maximum likelihood estimation; optical character recognition; stochastic processes; business card; business letter applications; document understanding system; functional labeling; maximum likelihood parse; optical character recognition system; regular expression matching; stochastic context-free grammars; Business; Context modeling; Data mining; Humans; Information technology; Optical character recognition software; Particle separators; Production; Stochastic processes; Stochastic systems;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on
  • ISSN
    1520-5263
  • Print_ISBN
    0-7695-2420-6
  • Type

    conf

  • DOI
    10.1109/ICDAR.2005.93
  • Filename
    1575598