• DocumentCode
    2148572
  • Title

    Automatic Content Extraction on Semi-structured Documents

  • Author

    Santos, José Eduardo Bastos dos

  • Author_Institution
    Perceptive Software, Shawnee, OK, USA
  • fYear
    2011
  • fDate
    18-21 Sept. 2011
  • Firstpage
    1235
  • Lastpage
    1239
  • Abstract
    Extracting specific content from certain types of documents can be a very challenging task, especially when developing a not so tailored solution and refraining from using explicit contextual information. In this paper, we address the problem of automatically extracting data from semi-structured documents through an unsupervised process based on an analysis of the document´s own morphological composition. We also discuss how this approach can be applied to different types of documents, with special attention being paid to college transcripts. The success of our method is supported by extensive tests, from which we have drawn some authentic examples.
  • Keywords
    content management; document handling; authentic example; automatic content extraction; automatic data extraction; college transcripts; contextual information; morphological composition; semistructured document; unsupervised process; Accuracy; Conferences; Educational institutions; Feature extraction; Layout; Text analysis; automatic zoning; college transcripts; data extraction; document image understanding; geometric and logical layout analysis; invoices; page decomposition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2011 International Conference on
  • Conference_Location
    Beijing
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4577-1350-7
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2011.249
  • Filename
    6065507