DocumentCode
2148572
Title
Automatic Content Extraction on Semi-structured Documents
Author
Santos, José Eduardo Bastos dos
Author_Institution
Perceptive Software, Shawnee, OK, USA
fYear
2011
fDate
18-21 Sept. 2011
Firstpage
1235
Lastpage
1239
Abstract
Extracting specific content from certain types of documents can be a very challenging task, especially when developing a not so tailored solution and refraining from using explicit contextual information. In this paper, we address the problem of automatically extracting data from semi-structured documents through an unsupervised process based on an analysis of the document´s own morphological composition. We also discuss how this approach can be applied to different types of documents, with special attention being paid to college transcripts. The success of our method is supported by extensive tests, from which we have drawn some authentic examples.
Keywords
content management; document handling; authentic example; automatic content extraction; automatic data extraction; college transcripts; contextual information; morphological composition; semistructured document; unsupervised process; Accuracy; Conferences; Educational institutions; Feature extraction; Layout; Text analysis; automatic zoning; college transcripts; data extraction; document image understanding; geometric and logical layout analysis; invoices; page decomposition;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location
Beijing
ISSN
1520-5363
Print_ISBN
978-1-4577-1350-7
Electronic_ISBN
1520-5363
Type
conf
DOI
10.1109/ICDAR.2011.249
Filename
6065507
Link To Document