DocumentCode
478604
Title
Information Extraction by Two Dimensional Parser
Author
Takasu, Atsuhiro
Author_Institution
Nat. Inst. of Inf., Tokyo
Volume
1
fYear
2008
fDate
3-5 Nov. 2008
Firstpage
333
Lastpage
340
Abstract
This paper proposes a learning algorithm for a two dimensional parser. The parser is designed to analyze page layout of documents and extract information using both textual and layout information. The parsing rules are expressed by an extended stochastic context free grammar that decomposes tokens located in two dimensional space both horizontally and vertically. In this paper we focus on the learning aspect of the parser and propose a learning algorithm based on the expectation maximization technique where the dynamic programming (DP) technique is used for efficient process. We apply the proposed algorithm to acquire a stochastic parser for information extraction from scanned document images and show that learned stochastic grammar extracts bibliographic data with high accuracy.
Keywords
context-free grammars; dynamic programming; expectation-maximisation algorithm; information retrieval; learning (artificial intelligence); dynamic programming; expectation maximization technique; information extraction; layout information; learning algorithm; stochastic context free grammar; textual information; two dimensional parser; Couplings; Data mining; Image analysis; Image segmentation; Information analysis; Information retrieval; Natural language processing; Software libraries; Stochastic processes; Text analysis; EM algorithm; layout analysis; stochastic page grammar;
fLanguage
English
Publisher
ieee
Conference_Titel
Tools with Artificial Intelligence, 2008. ICTAI '08. 20th IEEE International Conference on
Conference_Location
Dayton, OH
ISSN
1082-3409
Print_ISBN
978-0-7695-3440-4
Type
conf
DOI
10.1109/ICTAI.2008.106
Filename
4669708
Link To Document