DocumentCode :
478604
Title :
Information Extraction by Two Dimensional Parser
Author :
Takasu, Atsuhiro
Author_Institution :
Nat. Inst. of Inf., Tokyo
Volume :
1
fYear :
2008
fDate :
3-5 Nov. 2008
Firstpage :
333
Lastpage :
340
Abstract :
This paper proposes a learning algorithm for a two dimensional parser. The parser is designed to analyze page layout of documents and extract information using both textual and layout information. The parsing rules are expressed by an extended stochastic context free grammar that decomposes tokens located in two dimensional space both horizontally and vertically. In this paper we focus on the learning aspect of the parser and propose a learning algorithm based on the expectation maximization technique where the dynamic programming (DP) technique is used for efficient process. We apply the proposed algorithm to acquire a stochastic parser for information extraction from scanned document images and show that learned stochastic grammar extracts bibliographic data with high accuracy.
Keywords :
context-free grammars; dynamic programming; expectation-maximisation algorithm; information retrieval; learning (artificial intelligence); dynamic programming; expectation maximization technique; information extraction; layout information; learning algorithm; stochastic context free grammar; textual information; two dimensional parser; Couplings; Data mining; Image analysis; Image segmentation; Information analysis; Information retrieval; Natural language processing; Software libraries; Stochastic processes; Text analysis; EM algorithm; layout analysis; stochastic page grammar;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Tools with Artificial Intelligence, 2008. ICTAI '08. 20th IEEE International Conference on
Conference_Location :
Dayton, OH
ISSN :
1082-3409
Print_ISBN :
978-0-7695-3440-4
Type :
conf
DOI :
10.1109/ICTAI.2008.106
Filename :
4669708
Link To Document :
بازگشت