• DocumentCode
    60
  • Title

    Annotating Search Results from Web Databases

  • Author

    Yiyao Lu ; Hai He ; Hongkun Zhao ; Weiyi Meng ; Yu, Chu

  • Author_Institution
    Dept. of Comput. Sci., Binghamton Univ., Binghamton, NY, USA
  • Volume
    25
  • Issue
    3
  • fYear
    2013
  • fDate
    Mar-13
  • Firstpage
    514
  • Lastpage
    527
  • Abstract
    An increasing number of databases have become web accessible through HTML form-based search interfaces. The data units returned from the underlying database are usually encoded into the result pages dynamically for human browsing. For the encoded data units to be machine processable, which is essential for many applications such as deep web data collection and Internet comparison shopping, they need to be extracted out and assigned meaningful labels. In this paper, we present an automatic annotation approach that first aligns the data units on a result page into different groups such that the data in the same group have the same semantic. Then, for each group we annotate it from different aspects and aggregate the different annotations to predict a final annotation label for it. An annotation wrapper for the search site is automatically constructed and can be used to annotate new result pages from the same web database. Our experiments indicate that the proposed approach is highly effective.
  • Keywords
    Internet; Web sites; hypermedia markup languages; information retrieval systems; HTML form-based search interfaces; Internet comparison shopping; SRR; WDB; Web data collection; Web databases; annotation wrapper; automatic annotation approach; database encoding; encoded data units; human browsing; machine processable data units; search result records; search site; Clustering algorithms; Data mining; Database systems; HTML; Information retrieval; Ontologies; Semantics; Data alignment; data annotation; web database; wrapper generation;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2011.175
  • Filename
    5989804