DocumentCode :
610308
Title :
Finding connected components in map-reduce in logarithmic rounds
Author :
Rastogi, V. ; Machanavajjhala, A. ; Chitnis, L. ; Das Sarma, Akash
Author_Institution :
Google, Mountain View, CA, USA
fYear :
2013
fDate :
8-12 April 2013
Firstpage :
50
Lastpage :
61
Abstract :
Given a large graph G = (V, E) with millions of nodes and edges, how do we compute its connected components efficiently? Recent work addresses this problem in map-reduce, where a fundamental trade-off exists between the number of map-reduce rounds and the communication of each round. Denoting d the diameter of the graph, and n the number of nodes in the largest component, all prior techniques for map-reduce either require a linear, Θ(d), number of rounds, or a quadratic, Θ (n|V| + |E|), communication per round. We propose here two efficient map-reduce algorithms: (i) Hash-Greater-to-Min, which is a randomized algorithm based on PRAM techniques, requiring O(log n) rounds and O(|V | + |E|) communication per round, and (ii) Hash-to-Min, which is a novel algorithm, provably finishing in O(log n) iterations for path graphs. The proof technique used for Hash-to-Min is novel, but not tight, and it is actually faster than Hash-Greater-to-Min in practice. We conjecture that it requires 2 log d rounds and 3(|V| + |E|) communication per round, as demonstrated in our experiments. Using secondary sorting, a standard map-reduce feature, we scale Hash-to-Min to graphs with very large connected components. Our techniques for connected components can be applied to clustering as well. We propose a novel algorithm for agglomerative single linkage clustering in map-reduce. This is the first map-reduce algorithm for clustering in at most O(log n) rounds, where n is the size of the largest cluster. We show the effectiveness of all our algorithms through detailed experiments on large synthetic as well as real-world datasets.
Keywords :
computational complexity; file organisation; graph theory; pattern clustering; randomised algorithms; sorting; PRAM technique; agglomerative single linkage clustering; connected component; graph diameter; hash-greater-to-min; logarithmic rounds; map-reduce algorithm; map-reduce rounds; proof technique; randomized algorithm; real-world dataset; secondary sorting; Clustering algorithms; Complexity theory; Convergence; Couplings; Merging; Phase change random access memory; Vegetation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering (ICDE), 2013 IEEE 29th International Conference on
Conference_Location :
Brisbane, QLD
ISSN :
1063-6382
Print_ISBN :
978-1-4673-4909-3
Electronic_ISBN :
1063-6382
Type :
conf
DOI :
10.1109/ICDE.2013.6544813
Filename :
6544813
Link To Document :
بازگشت