مرکز منطقه ای اطلاع رساني علوم و فناوري

DocumentCode :

531677

Title :

Parsing Publication Lists on the Web

Author :

Yang, Kai-Hsiang ; Ho, Jan-Ming

Author_Institution :

Dept. of Math. & Inf. Educ., Nat. Taipei Univ. of Educ., Taipei, Taiwan

Volume :

fYear :

2010

fDate :

Aug. 31 2010-Sept. 3 2010

Firstpage :

444

Lastpage :

447

Abstract :

Researchers usually present their publication records (we call citation records in this paper) on publication lists on the Web, which could be an important data source for many applications to collect more publication records than from some digital libraries, such as DBLP. However, it is still not easy to design an algorithm to extract citation records from publication lists because of the diversity of page layouts and citation formats. In this paper, we propose an automatic approach to extract citation records from publication list pages by utilizing two properties. First, citation records are usually represented as nodes at the same level in the DOM tree. Second, citation records in the same page are presented by similar HTML tags. Extensive experiments are conducted to measure the effects of all parameters and system performance. Experiment results show that our approach performs stable and well (with 86.2% of F-measure on average).

Keywords :

Internet; program compilers; publishing; Web; citation records; digital libraries; publication list parsing; Web mining; citation extraction; data extraction;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on

Conference_Location :

Toronto, ON

Print_ISBN :

978-1-4244-8482-9

Electronic_ISBN :

978-0-7695-4191-4

Type :

conf

DOI :

10.1109/WI-IAT.2010.206

Filename :

5616659

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=531677