DocumentCode :
125349
Title :
A Web Service for Scholarly Big Data Information Extraction
Author :
Williams, Kresimir ; Lichi Li ; Khabsa, Madian ; Jian Wu ; Shih, Patrick C. ; Giles, C. Lee
Author_Institution :
Inf. Sci. & Technol., Comput. Sci. & Eng, Pennsylvania State Univ., University Park, PA, USA
fYear :
2014
fDate :
June 27 2014-July 2 2014
Firstpage :
105
Lastpage :
112
Abstract :
The automatic extraction of metadata and other information from scholarly documents is a common task in academic digital libraries, search engines, and document management systems to allow for the management and categorization of documents and for search to take place. A Web-accessible API can simplify this extraction by providing a single point of operation for extraction that can be incorporated into multiple document workflows without the need for each workflow to implement and support its own extraction functionality. In this paper, we describe CiteSeerExtractor, a RESTful API for scholarly information extraction that exploits the fact that there is duplication in scholarly big data and makes use of a near duplicate matching backend. The backend stores previously extracted metadata and avoids extracting metadata from a document if it has already been extracted before. We describe the design, implementation, and functionality of CiteSeerExtractor and show how the duplicate document matching results in a difference of 8.46% in the time required to extract header and citation information from approximately 3.5 million documents compared to a baseline.
Keywords :
Big Data; Web services; application program interfaces; information retrieval; meta data; Big Data information extraction; CiteSeerExtractor; RESTful API; Web service; Web-accessible API; academic digital libraries; automatic metadata extraction; document categorization; document management systems; document workflows; extraction functionality; near duplicate matching backend; search engines; Big data; Data mining; Databases; Hamming distance; Information retrieval; Web servers; CiteSeerExtractor; Web service; information extraction; scholarly big data;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Services (ICWS), 2014 IEEE International Conference on
Conference_Location :
Anchorage, AK
Print_ISBN :
978-1-4799-5053-9
Type :
conf
DOI :
10.1109/ICWS.2014.27
Filename :
6928887
Link To Document :
بازگشت