• DocumentCode
    610392
  • Title

    Automatic extraction of top-k lists from the web

  • Author

    Zhixian Zhang ; Zhu, K.Q. ; Haixun Wang ; Hongsong Li

  • Author_Institution
    Shanghai Jiao Tong Univ., Shanghai, China
  • fYear
    2013
  • fDate
    8-12 April 2013
  • Firstpage
    1057
  • Lastpage
    1068
  • Abstract
    This paper is concerned with information extraction from top-k web pages, which are web pages that describe top k instances of a topic which is of general interest. Examples include “the 10 tallest buildings in the world”, “the 50 hits of 2010 you don´t want to miss”, etc. Compared to other structured information on the web (including web tables), information in top-k lists is larger and richer, of higher quality, and generally more interesting. Therefore top-k lists are highly valuable. For example, it can help enrich open-domain knowledge bases (to support applications such as search or fact answering). In this paper, we present an efficient method that extracts top-k lists from web pages with high performance. Specifically, we extract more than 1.7 million top-k lists from a web corpus of 1.6 billion pages with 92.0% precision and 72.3% recall.
  • Keywords
    Web sites; information analysis; knowledge based systems; Web corpus; automatic top-k lists extraction; information extraction; open-domain knowledge bases; top-k Web pages; Companies; Context; Data mining; Digital audio broadcasting; Feature extraction; Knowledge based systems; Web pages; Web information extraction; list extraction; top-k lists; web mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2013 IEEE 29th International Conference on
  • Conference_Location
    Brisbane, QLD
  • ISSN
    1063-6382
  • Print_ISBN
    978-1-4673-4909-3
  • Electronic_ISBN
    1063-6382
  • Type

    conf

  • DOI
    10.1109/ICDE.2013.6544897
  • Filename
    6544897