Title :
Mining Publication Records on Personal Publication Web Pages Based on Conditional Random Fields
Author :
Jen-Ming Chung ; Ya-Huei Lin ; Hahn-Ming Lee ; Jan-Ming Ho
Author_Institution :
Inst. of Inf. Sci., Acad. Sinica, Taipei, Taiwan
Abstract :
A publication record denotes a list of semi-structured citation string of publications of a research institute or an individual researcher. Publication records are integrated into a digital library to become an important knowledge base which in turn enables a variety of applications. A publication record is usually found among other information on a publication Web page (or publication page for short). It is thus an interesting problem to extract publication record from these Web pages. The problem is difficult due to several reasons including the flexibility in formatting the metadata of a publication into a semi-structured citation string and expressing the citation string into its visual presentation in HTML. Furthermore, two citation strings with similar visual presentation on the same Web page may have different HTML constructs. In this paper, we present a content analysis approach based on Conditional Random Fields and data region boundary analysis to automatically extract citation record on a publication page. Experimental results show that our method performs well on a benchmark containing manually crafted publication Web pages. The precision, recall, and F-measure are 82.5%, 87.6%, and 85.0% respectively. This is an improvement over previous results.
Keywords :
Internet; citation analysis; data mining; digital libraries; hypermedia markup languages; meta data; statistical analysis; Conditional Random Fields; HTML; content analysis approach; data mining; data region boundary analysis; digital library; metadata; personal publication Web pages; publication records; semi-structured citation string; visual presentation; conditional random fields; data region boundary analysis; publication record extraction;
Conference_Titel :
Web Intelligence and Intelligent Agent Technology (WI-IAT), 2012 IEEE/WIC/ACM International Conferences on
Conference_Location :
Macau
Print_ISBN :
978-1-4673-6057-9
DOI :
10.1109/WI-IAT.2012.67