• DocumentCode
    3268034
  • Title

    Probe, cluster, and discover: focused extraction of QA-Pagelets from the deep Web

  • Author

    Caverlee, James ; Liu, Ling ; Buttler, David

  • Author_Institution
    Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA, USA
  • fYear
    2004
  • fDate
    30 March-2 April 2004
  • Firstpage
    103
  • Lastpage
    114
  • Abstract
    We introduce the concept of a QA-Pagelet to refer to the content region in a dynamic page that contains query matches. We present THOR, a scalable and efficient mining system for discovering and extracting QA-Pagelets from the deep Web. A unique feature of THOR is its two-phase extraction framework. In the first phase, pages from a deep Web site are grouped into distinct clusters of structurally-similar pages. In the second phase, pages from each page cluster are examined through a subtree filtering algorithm that exploits the structural and content similarity at subtree level to identify the QA-Pagelets.
  • Keywords
    Internet; Web sites; content management; data mining; feature extraction; online front-ends; pattern clustering; query formulation; QA-Pagelet discovery; QA-Pagelet extraction; content region; deep Web; dynamic page; page cluster; query matches; subtree filtering algorithm; two-phase extraction framework; Data mining; Databases; Educational institutions; Filtering algorithms; Indexing; Navigation; Probes; Robustness; Search engines; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2004. Proceedings. 20th International Conference on
  • ISSN
    1063-6382
  • Print_ISBN
    0-7695-2065-0
  • Type

    conf

  • DOI
    10.1109/ICDE.2004.1319988
  • Filename
    1319988