Automatic extraction of top-k lists from the web

Author

Zhixian Zhang ; Zhu, K.Q. ; Haixun Wang ; Hongsong Li

Author_Institution

Shanghai Jiao Tong Univ., Shanghai, China

fYear

2013

fDate

8-12 April 2013

Firstpage

1057

Lastpage

1068

Abstract

This paper is concerned with information extraction from top-k web pages, which are web pages that describe top k instances of a topic which is of general interest. Examples include “the 10 tallest buildings in the world”, “the 50 hits of 2010 you don´t want to miss”, etc. Compared to other structured information on the web (including web tables), information in top-k lists is larger and richer, of higher quality, and generally more interesting. Therefore top-k lists are highly valuable. For example, it can help enrich open-domain knowledge bases (to support applications such as search or fact answering). In this paper, we present an efficient method that extracts top-k lists from web pages with high performance. Specifically, we extract more than 1.7 million top-k lists from a web corpus of 1.6 billion pages with 92.0% precision and 72.3% recall.

Keywords

Web sites; information analysis; knowledge based systems; Web corpus; automatic top-k lists extraction; information extraction; open-domain knowledge bases; top-k Web pages; Companies; Context; Data mining; Digital audio broadcasting; Feature extraction; Knowledge based systems; Web pages; Web information extraction; list extraction; top-k lists; web mining;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Engineering (ICDE), 2013 IEEE 29th International Conference on

Conference_Location

Brisbane, QLD

ISSN

1063-6382

Print_ISBN

978-1-4673-4909-3

Electronic_ISBN

1063-6382

Type

conf

DOI

10.1109/ICDE.2013.6544897

Filename

6544897