Title :
L2AP: Fast cosine similarity search with prefix L-2 norm bounds
Author :
Anastasiu, David C. ; Karypis, George
Author_Institution :
Dept. of Comput. Sci. & Eng., Univ. of Minnesota, Minneapolis, MN, USA
fDate :
March 31 2014-April 4 2014
Abstract :
The All-Pairs similarity search, or self-similarity join problem, finds all pairs of vectors in a high dimensional sparse dataset with a similarity value higher than a given threshold. The problem has been classically solved using a dynamically built inverted index. The search time is reduced by early pruning of candidates using size and value-based bounds on the similarity. In the context of cosine similarity and weighted vectors, leveraging the Cauchy-Schwarz inequality, we propose new ℓ2-norm bounds for reducing the inverted index size, candidate pool size, and the number of full dot-product computations. We tighten previous candidate generation and verification bounds and introduce several new ones to further improve our algorithm´s performance. Our new pruning strategies enable significant speedups over baseline approaches, most times outperforming even approximate solutions. We perform an extensive evaluation of our algorithm, L2AP, and compare against state-of-the-art exact and approximate methods, AllPairs, MMJoin, and BayesLSH, across a variety of real-world datasets and similarity thresholds.
Keywords :
information filtering; AllPairs method; BayesLSH method; Cauchy-Schwarz inequality; L2AP algorithm; MMJoin method; all-pairs similarity search; cosine similarity context; dynamically built inverted index; fast cosine similarity search; filtering strategies; high dimensional sparse dataset; prefix l2-norm bounds; pruning strategies; search time; self-similarity join problem; similarity thresholds; similarity value; size-based bounds; value-based bounds; weighted vectors context; Approximation algorithms; Heuristic algorithms; Indexing; Search problems; Upper bound; Vectors;
Conference_Titel :
Data Engineering (ICDE), 2014 IEEE 30th International Conference on
Conference_Location :
Chicago, IL
DOI :
10.1109/ICDE.2014.6816700