DocumentCode
1368991
Title
A Fuzzy Logic Approach to Wrapping PDF Documents
Author
Flesca, Sergio ; Masciari, Elio ; Tagarelli, Andrea
Author_Institution
Dept. of Electron., Comput. & Syst. Sci., Univ. of Calabria, Rende, Italy
Volume
23
Issue
12
fYear
2011
Firstpage
1826
Lastpage
1841
Abstract
The PDF format represents the de facto standard for print-oriented documents. In this paper, we address the problem of wrapping PDF documents, which raises new challenges in several contexts of text data management. Our proposal is based on a novel bottom-up hierarchical wrapping approach that exploits fuzzy logic to handle the “uncertainty” which is intrinsic to the structure and presentation of PDF documents. A PDF wrapper is defined by specifying a set of group type definitions that impose a target structure to groups of tokens containing the required information. Constraints on token groupings are formulated as fuzzy conditions, which are defined on spatial and content predicates of tokens. We define a formal semantics for PDF wrappers and propose an algorithm for wrapper evaluation working in polynomial time with respect to the size of a PDF document. The proposed approach has been implemented in a wrapper generation system that offers visual capabilities to assist the designer in specifying and evaluating a PDF wrapper. Experimental results have shown good accuracy and applicability of our system to PDF documents of various domains.
Keywords
electronic publishing; formal specification; fuzzy logic; printing; programming language semantics; storage management; text analysis; PDF documents wrapping; bottom-up hierarchical wrapping approach; de facto standard; formal semantics; fuzzy logic approach; polynomial time; print-oriented documents; text data management; token groupings; wrapper generation system; Data mining; Fuzzy logic; Information extraction; Information retrieval; Portable document format; Uncertainty; Visualization; Adobe PDF; Information extraction; PDFWrap system.; fuzzy logic; print-oriented documents; wrapping;
fLanguage
English
Journal_Title
Knowledge and Data Engineering, IEEE Transactions on
Publisher
ieee
ISSN
1041-4347
Type
jour
DOI
10.1109/TKDE.2010.220
Filename
5620910
Link To Document