• DocumentCode
    478604
  • Title

    Information Extraction by Two Dimensional Parser

  • Author

    Takasu, Atsuhiro

  • Author_Institution
    Nat. Inst. of Inf., Tokyo
  • Volume
    1
  • fYear
    2008
  • fDate
    3-5 Nov. 2008
  • Firstpage
    333
  • Lastpage
    340
  • Abstract
    This paper proposes a learning algorithm for a two dimensional parser. The parser is designed to analyze page layout of documents and extract information using both textual and layout information. The parsing rules are expressed by an extended stochastic context free grammar that decomposes tokens located in two dimensional space both horizontally and vertically. In this paper we focus on the learning aspect of the parser and propose a learning algorithm based on the expectation maximization technique where the dynamic programming (DP) technique is used for efficient process. We apply the proposed algorithm to acquire a stochastic parser for information extraction from scanned document images and show that learned stochastic grammar extracts bibliographic data with high accuracy.
  • Keywords
    context-free grammars; dynamic programming; expectation-maximisation algorithm; information retrieval; learning (artificial intelligence); dynamic programming; expectation maximization technique; information extraction; layout information; learning algorithm; stochastic context free grammar; textual information; two dimensional parser; Couplings; Data mining; Image analysis; Image segmentation; Information analysis; Information retrieval; Natural language processing; Software libraries; Stochastic processes; Text analysis; EM algorithm; layout analysis; stochastic page grammar;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Tools with Artificial Intelligence, 2008. ICTAI '08. 20th IEEE International Conference on
  • Conference_Location
    Dayton, OH
  • ISSN
    1082-3409
  • Print_ISBN
    978-0-7695-3440-4
  • Type

    conf

  • DOI
    10.1109/ICTAI.2008.106
  • Filename
    4669708