Title :
Scalable SimRank join algorithm
Author :
Maehara, Takanori ; Kusumoto, Mitsuru ; Kawarabayashi, Ken-ichi
Author_Institution :
ERATO, Kawarabayashi Large Graph Project, JST, Japan
Abstract :
Similarity join finds all pairs of objects (i, j) with similarity score s(i, j) greater than some specified threshold θ. This is a fundamental query problem in the database research community, and is used in many practical applications, such as duplicate detection, merge/purge, record linkage, object matching, and reference conciliation. In this paper, we propose a scalable approximation algorithm with an arbitrary accuracy for the similarity join problem with the SimRank similarity measure. The algorithm consists of two phases: filter and verification. The filter phase enumerates similar pair candidates, and the similarity of each candidate is then assessed in the verification phase. The scalability of the proposed algorithm is experimentally verified for large real networks. The complexity depends only on the number of similar pairs, but does not depend on all pairs O(n2). The proposed algorithm scales up to the network of 5M vertices and 70M edges. By comparing the state-of-the-art algorithms, it is about 10 times faster and it requires about 10 times smaller memory.
Keywords :
computational complexity; query processing; computational complexity; duplicate detection; fundamental query problem; merge-purge; object matching; record linkage; reference conciliation; scalable SimRank join algorithm; scalable approximation algorithm; Accuracy; Algorithm design and analysis; Approximation algorithms; Complexity theory; Couplings; Memory management; Monte Carlo methods;
Conference_Titel :
Data Engineering (ICDE), 2015 IEEE 31st International Conference on
Conference_Location :
Seoul
DOI :
10.1109/ICDE.2015.7113318