DocumentCode
2174685
Title
Scalability Issues for Self Similarity Join in Distributed Systems
Author
Gennaro, Claudio ; Rabit, Fausto
Author_Institution
ISTI-CNR, Pisa, Italy
fYear
2010
fDate
17-19 Feb. 2010
Firstpage
309
Lastpage
316
Abstract
Efficient processing of similarity joins is important for a large class of data analysis and data-mining applications. This primitive finds all pairs of records within a predefined distance threshold of each other. However, most of the existing approaches have been based on spatial join techniques designed primarily for data in a vector space. Treating data collections as metric objects brings a great advantage in generality, because a single metric technique can be applied to many specific search problems quite different in nature. In this paper, we concentrate our attention on a special form of join, the Self Similarity Join, which retrieves pairs from the same dataset. In particular, we consider the case in which the dataset is split into subsets that are searched for self similarity join independently (e. g, as in a distributed computing environment). To this end, we formalize the abstract concept of ¿-Cover, prove its correctness, and demonstrate its effectiveness by applying it to two real implementations on a real-life large dataset.
Keywords
data mining; set theory; software metrics; data analysis; data-mining applications; distributed computing environment; distributed systems; real-life large dataset; spatial join techniques; subsets; Cleaning; Clustering algorithms; Data analysis; Data mining; Databases; Distributed computing; Information retrieval; Scalability; Search problems; Time series analysis;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel, Distributed and Network-Based Processing (PDP), 2010 18th Euromicro International Conference on
Conference_Location
Pisa
ISSN
1066-6192
Print_ISBN
978-1-4244-5672-7
Electronic_ISBN
1066-6192
Type
conf
DOI
10.1109/PDP.2010.73
Filename
5452451
Link To Document