A Rule-Based Framework of Metadata Extraction from Scientific Papers

Author

Guo, Zhixin ; Jin, Hai

Author_Institution

Cluster & Grid Comput. Lab., Huazhong Univ. of Sci. & Technol., Wuhan, China

fYear

2011

fDate

14-17 Oct. 2011

Firstpage

400

Lastpage

404

Abstract

Most scientific documents on the web are unstructured or semi-structured, and the automatic document metadata extraction process becomes an important task. This paper describes a framework for automatic metadata extraction from scientific papers. Based on a spatial and visual knowledge principle, our system can extract title, authors and abstract from scientific papers. We utilize format information such as font size and position to guide the metadata extraction process. The experiment results show that our system achieves a high accuracy in header metadata extraction which can effectively assist the automatic index creation for digital libraries.

Keywords

Internet; digital libraries; document handling; indexing; information retrieval; knowledge based systems; meta data; natural sciences computing; Web; automatic document metadata extraction; automatic index creation; digital libraries; header metadata extraction; rule-based framework; scientific documents; scientific papers; spatial knowledge principle; visual knowledge principle; Accuracy; Data mining; Layout; Libraries; Portable document format; Semantics; XML; document metadata; information extraction; rule-based approach;

fLanguage

English

Publisher

ieee

Conference_Titel

Distributed Computing and Applications to Business, Engineering and Science (DCABES), 2011 Tenth International Symposium on

Conference_Location

Wuxi

Print_ISBN

978-1-4577-0327-0

Type

conf

DOI

10.1109/DCABES.2011.14

Filename

6118700