• DocumentCode
    478637
  • Title

    Dolores: An  Interactive and Class-Free Approach for Document Logical Restructuring

  • Author

    Bloechle, Jean-Luc ; Pugin, Catherine ; Ingold, Rolf

  • fYear
    2008
  • fDate
    16-19 Sept. 2008
  • Firstpage
    644
  • Lastpage
    652
  • Abstract
    Physical and logical structure recovering from electronic documents is still an open issue. In this paper, we propose a flexible and efficient approach for recovering document structures from PDF files. After a brief introduction of the PDF format and its major features, we report about our evaluation of different existing tools and works for PDF content extraction and analysis. To overcome the weaknesses of these systems, we propose a new analysis strategy, based on an intermediate representation, called XCDF, which enables representing physical structures in a canonical way. This paper then describes the PDF reverse engineering workflow and focuses on the document logical restructuring. Finally, the paper concludes with potential future improvements.
  • Keywords
    Costs; Data mining; Feature extraction; Image databases; Postal services; Robustness; Sorting; Spatial databases; Transportation; Visual databases; document restructuring; logical structure; pdf reengineering; physical structure;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis Systems, 2008. DAS '08. The Eighth IAPR International Workshop on
  • Conference_Location
    Nara, Japan
  • Print_ISBN
    978-0-7695-3337-7
  • Type

    conf

  • DOI
    10.1109/DAS.2008.44
  • Filename
    4670017