• DocumentCode
    3695098
  • Title

    Extracting structured data from unstructured document with incomplete resources

  • Author

    Hervé Déjean

  • Author_Institution
    Xerox Research Centre Europe, Meylan, France
  • fYear
    2015
  • Firstpage
    271
  • Lastpage
    275
  • Abstract
    We present a method for extracting structured elements of information, called structured data (sdata), from ocr´ed pages. The method first analyzes the layout of the page, building several concurrent layout structures. Then a tagging step is performed in order to tag textual elements based on their content. Combining the layout structures and the tagged elements, layout models for representing the structured data are inferred for the current page. These models are used to correct or tag some elements missed by the tagging step. The final set of structured data is extracted. An evaluation is presented.
  • Keywords
    "Layout","Electronic mail","Tagging","Uniform resource locators","Accuracy"
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2015 13th International Conference on
  • Type

    conf

  • DOI
    10.1109/ICDAR.2015.7333766
  • Filename
    7333766