Title :
User-Guided Wrapping of PDF Documents Using Graph Matching Techniques
Author_Institution :
Database & Artificial Intell. Group, Vienna Univ. of Technol., Vienna, Austria
Abstract :
There are a number of established products on the market for wrapping - semi-automatic navigation and extraction of data - from Web pages. These solutions make use of the inherent structure of HTML to locate instances of data to be wrapped. As PDF documents do not have such a structure, wrapping PDF documents has long been recognized as a challenging problem. We have developed a novel system for wrapping PDF documents, which is currently at a prototype stage. A PDF document is represented as an attributed relational graph, in which nodes represent physical items on the page and edges represent spatial and logical relationships. A wrapper is defined as a subgraph of the document with additional conditions, and can quickly and intuitively be created by a non-expert using the GUI. An algorithm based on subgraph isomorphism is then used to find the data instances and extract the required data. Experiments show that our approach achieves good results with good execution time.
Keywords :
Internet; data structures; graph theory; graphical user interfaces; hypermedia markup languages; information retrieval; pattern matching; GUI; HTML; PDF document; Web page; attributed relational graph matching technique; data extraction; data instance; semiautomatic navigation; subgraph isomorphism; user-guided wrapping; Data mining; Databases; Graphical user interfaces; HTML; Information analysis; Navigation; Prototypes; Text analysis; Web pages; Wrapping; PDF; document analysis; document understanding; graph matching; wrapping;
Conference_Titel :
Document Analysis and Recognition, 2009. ICDAR '09. 10th International Conference on
Conference_Location :
Barcelona
Print_ISBN :
978-1-4244-4500-4
Electronic_ISBN :
1520-5363
DOI :
10.1109/ICDAR.2009.238