DocumentCode :
3122882
Title :
Top-k Set Similarity Joins
Author :
Xiao, Chuan ; Wang, Wei ; Lin, Xuemin ; Shang, Haichuan
Author_Institution :
NICTA, Univ. of New South Wales, Kensington, NSW
fYear :
2009
fDate :
March 29 2009-April 2 2009
Firstpage :
916
Lastpage :
927
Abstract :
Similarity join is a useful primitive operation underlying many applications, such as near duplicate Web page detection, data integration, and pattern recognition. Traditional similarity joins require a user to specify a similarity threshold. In this paper, we study a variant of the similarity join, termed top-k set similarity join. It returns the top-k pairs of records ranked by their similarities, thus eliminating the guess work users have to perform when the similarity threshold is unknown before hand. An algorithm, topk-join, is proposed to answer top-k similarity join efficiently. It is based on the prefix filtering principle and employs tight upper bounding of similarity values of unseen pairs. Experimental results demonstrate the efficiency of the proposed algorithm on large-scale real datasets.
Keywords :
data handling; query processing; Web page detection; data integration; large-scale real datasets; pattern recognition; prefix filtering principle; top-k pairs; top-k set similarity joins; Couplings; Data engineering; Data mining; Euclidean distance; Filtering; Large-scale systems; Pattern recognition; Time factors; Upper bound; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering, 2009. ICDE '09. IEEE 25th International Conference on
Conference_Location :
Shanghai
ISSN :
1084-4627
Print_ISBN :
978-1-4244-3422-0
Electronic_ISBN :
1084-4627
Type :
conf
DOI :
10.1109/ICDE.2009.111
Filename :
4812465
Link To Document :
بازگشت