• DocumentCode
    751090
  • Title

    Effectively mining and using coverage and overlap statistics for data integration

  • Author

    Nie, Zaiqing ; Kambhampati, Subbarao ; Nambiar, Ullas

  • Author_Institution
    Microsoft Res. Asia, Beijing, China
  • Volume
    17
  • Issue
    5
  • fYear
    2005
  • fDate
    5/1/2005 12:00:00 AM
  • Firstpage
    638
  • Lastpage
    651
  • Abstract
    Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing. Despite this recognition, there are no effective approaches for learning the needed statistics. The key challenge in learning such statistics is keeping the number of needed statistics low enough to have the storage and learning costs manageable. In this paper, we present a set of connected techniques that estimate the coverage and overlap statistics, while keeping the needed statistics tightly under control. Our approach uses a hierarchical classification of the queries and threshold-based variants of familiar data mining techniques to dynamically decide the level of resolution at which to learn the statistics. We describe the details of our method, and, present experimental results demonstrating the efficiency of the learning algorithms and the effectiveness of the learned statistics over both controlled data sources and in the context of BibFinder with autonomous online sources.
  • Keywords
    Internet; bibliographic systems; data mining; query processing; statistical analysis; BibFinder; coverage statistics; data integration; data mining; data source; overlap statistics; query processing; statistical information; Association rules; Bibliographies; Computer science; Costs; Data mining; Internet; Query processing; Software libraries; Statistics; Telecommunication traffic; Index Terms- Query optimization for data integration; association rule mining.; coverage and overlap statistics;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2005.76
  • Filename
    1411743