• DocumentCode
    3122882
  • Title

    Top-k Set Similarity Joins

  • Author

    Xiao, Chuan ; Wang, Wei ; Lin, Xuemin ; Shang, Haichuan

  • Author_Institution
    NICTA, Univ. of New South Wales, Kensington, NSW
  • fYear
    2009
  • fDate
    March 29 2009-April 2 2009
  • Firstpage
    916
  • Lastpage
    927
  • Abstract
    Similarity join is a useful primitive operation underlying many applications, such as near duplicate Web page detection, data integration, and pattern recognition. Traditional similarity joins require a user to specify a similarity threshold. In this paper, we study a variant of the similarity join, termed top-k set similarity join. It returns the top-k pairs of records ranked by their similarities, thus eliminating the guess work users have to perform when the similarity threshold is unknown before hand. An algorithm, topk-join, is proposed to answer top-k similarity join efficiently. It is based on the prefix filtering principle and employs tight upper bounding of similarity values of unseen pairs. Experimental results demonstrate the efficiency of the proposed algorithm on large-scale real datasets.
  • Keywords
    data handling; query processing; Web page detection; data integration; large-scale real datasets; pattern recognition; prefix filtering principle; top-k pairs; top-k set similarity joins; Couplings; Data engineering; Data mining; Euclidean distance; Filtering; Large-scale systems; Pattern recognition; Time factors; Upper bound; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2009. ICDE '09. IEEE 25th International Conference on
  • Conference_Location
    Shanghai
  • ISSN
    1084-4627
  • Print_ISBN
    978-1-4244-3422-0
  • Electronic_ISBN
    1084-4627
  • Type

    conf

  • DOI
    10.1109/ICDE.2009.111
  • Filename
    4812465