DocumentCode :
3756638
Title :
Optimization of All Pairs Similarity Search
Author :
Yuechen Chen;Xinhuai Tang;Bing Liu;Delai Chen
Author_Institution :
Sch. of Software Eng., Shanghai Jiao Tong Univ., Shanghai, China
fYear :
2015
Firstpage :
637
Lastpage :
642
Abstract :
All pairs similarity search (APSS) is the problem of finding all the similar pairs of items, whose similarity is above a given threshold. APSS algorithm is applied to many data mining fields, such as document matching, collaborative filtering. Due to a large scale of data in real life, some recent work used partitioning, inverted indexing, parallel accumulation, and hashing approximation to optimize the APSS algorithm. To optimize the APSS problem, this paper analyzes and compares two parallel approaches. To demonstrate the performance gain of our optimization approaches, we implement our algorithms on Spark and conduct the evaluation on a dataset of one million movies, which gains better performance speedup than other works.
Keywords :
"Optimization","Approximation algorithms","Indexes","Partitioning algorithms","Data mining","Search problems","Sparks"
Publisher :
ieee
Conference_Titel :
Computational Science and Computational Intelligence (CSCI), 2015 International Conference on
Type :
conf
DOI :
10.1109/CSCI.2015.16
Filename :
7424169
Link To Document :
بازگشت