Title :
Automatic extraction of top-k lists from the web
Author :
Zhixian Zhang ; Zhu, K.Q. ; Haixun Wang ; Hongsong Li
Author_Institution :
Shanghai Jiao Tong Univ., Shanghai, China
Abstract :
This paper is concerned with information extraction from top-k web pages, which are web pages that describe top k instances of a topic which is of general interest. Examples include “the 10 tallest buildings in the world”, “the 50 hits of 2010 you don´t want to miss”, etc. Compared to other structured information on the web (including web tables), information in top-k lists is larger and richer, of higher quality, and generally more interesting. Therefore top-k lists are highly valuable. For example, it can help enrich open-domain knowledge bases (to support applications such as search or fact answering). In this paper, we present an efficient method that extracts top-k lists from web pages with high performance. Specifically, we extract more than 1.7 million top-k lists from a web corpus of 1.6 billion pages with 92.0% precision and 72.3% recall.
Keywords :
Web sites; information analysis; knowledge based systems; Web corpus; automatic top-k lists extraction; information extraction; open-domain knowledge bases; top-k Web pages; Companies; Context; Data mining; Digital audio broadcasting; Feature extraction; Knowledge based systems; Web pages; Web information extraction; list extraction; top-k lists; web mining;
Conference_Titel :
Data Engineering (ICDE), 2013 IEEE 29th International Conference on
Conference_Location :
Brisbane, QLD
Print_ISBN :
978-1-4673-4909-3
Electronic_ISBN :
1063-6382
DOI :
10.1109/ICDE.2013.6544897