Title :
Optimization of All Pairs Similarity Search
Author :
Yuechen Chen;Xinhuai Tang;Bing Liu;Delai Chen
Author_Institution :
Sch. of Software Eng., Shanghai Jiao Tong Univ., Shanghai, China
Abstract :
All pairs similarity search (APSS) is the problem of finding all the similar pairs of items, whose similarity is above a given threshold. APSS algorithm is applied to many data mining fields, such as document matching, collaborative filtering. Due to a large scale of data in real life, some recent work used partitioning, inverted indexing, parallel accumulation, and hashing approximation to optimize the APSS algorithm. To optimize the APSS problem, this paper analyzes and compares two parallel approaches. To demonstrate the performance gain of our optimization approaches, we implement our algorithms on Spark and conduct the evaluation on a dataset of one million movies, which gains better performance speedup than other works.
Keywords :
"Optimization","Approximation algorithms","Indexes","Partitioning algorithms","Data mining","Search problems","Sparks"
Conference_Titel :
Computational Science and Computational Intelligence (CSCI), 2015 International Conference on
DOI :
10.1109/CSCI.2015.16