DocumentCode :
3485760
Title :
Figure Metadata Extraction from Digital Documents
Author :
Choudhury, Sagnik Ray ; MITRA, PINAKI ; Kirk, A. ; Szep, Silvia ; Pellegrino, Donald ; Jones, Simon ; Giles, C. Lee
Author_Institution :
Inf. Sci. & Technol., Pennsylvania State Univ., University Park, PA, USA
fYear :
2013
fDate :
25-28 Aug. 2013
Firstpage :
135
Lastpage :
139
Abstract :
Academic papers contain multiple figures (information graphics) representing important findings and experimental results. Automatic data extraction from such figures and classification of information graphics is not straightforward and a well studied problem in document analysis cite{4275059}. Also, very few digital library search engines index figures and/or associated metadata (figure caption) from PDF documents. We describe the very first step in indexing, classification and data extraction from figures in PDF documents - accurate automatic extraction of figures and associated metadata, a nontrivial task. Document layout, font information, lexical and linguistic features for figure caption extraction from PDF documents is considered for both rule based and machine learning based approaches. We also describe a digital library search engine that indexes figure captions and mentions from 150K documents, extracted by our custom built extractor.
Keywords :
computer graphics; document handling; feature extraction; learning (artificial intelligence); meta data; search engines; PDF documents; academic papers; automatic data extraction; automatic extraction; data extraction; digital documents; digital library search engine; document analysis; document layout; figure caption extraction; figure metadata extraction; font information; information graphics; lexical features; linguistic features; machine learning; Accuracy; Data mining; Feature extraction; Layout; Libraries; Portable document format; Search engines; information extraction; metadata based figure search;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location :
Washington, DC
ISSN :
1520-5363
Type :
conf
DOI :
10.1109/ICDAR.2013.34
Filename :
6628599
Link To Document :
بازگشت