DocumentCode
478637
Title
Dolores: An Interactive and Class-Free Approach for Document Logical Restructuring
Author
Bloechle, Jean-Luc ; Pugin, Catherine ; Ingold, Rolf
fYear
2008
fDate
16-19 Sept. 2008
Firstpage
644
Lastpage
652
Abstract
Physical and logical structure recovering from electronic documents is still an open issue. In this paper, we propose a flexible and efficient approach for recovering document structures from PDF files. After a brief introduction of the PDF format and its major features, we report about our evaluation of different existing tools and works for PDF content extraction and analysis. To overcome the weaknesses of these systems, we propose a new analysis strategy, based on an intermediate representation, called XCDF, which enables representing physical structures in a canonical way. This paper then describes the PDF reverse engineering workflow and focuses on the document logical restructuring. Finally, the paper concludes with potential future improvements.
Keywords
Costs; Data mining; Feature extraction; Image databases; Postal services; Robustness; Sorting; Spatial databases; Transportation; Visual databases; document restructuring; logical structure; pdf reengineering; physical structure;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis Systems, 2008. DAS '08. The Eighth IAPR International Workshop on
Conference_Location
Nara, Japan
Print_ISBN
978-0-7695-3337-7
Type
conf
DOI
10.1109/DAS.2008.44
Filename
4670017
Link To Document