DocumentCode :
2984761
Title :
High Performance Offline and Online Distributed Collaborative Filtering
Author :
Narang, Arun ; Srivastava, Anurag ; Katta, N.P.K.
Author_Institution :
IBM India Res. Lab., New Delhi, India
fYear :
2012
fDate :
10-13 Dec. 2012
Firstpage :
549
Lastpage :
558
Abstract :
Big data analytics is a hot research area both in academia and industry. It envisages processing massive amounts of data at high rates to generate new insights leading to positive impact (for both users and providers) of industries such as E-commerce, Telecom, Finance, Life Sciences and so forth. We consider collaborative filtering (CF) and Clustering algorithms that are key fundamental analytics kernels that help in achieving these aims. High throughput CF and co-clustering on highly sparse and massive datasets, along with a high prediction accuracy, is a computationally challenging problem. In this paper, we present a novel hierarchical design for soft real-time (less than 1-minute.) distributed co-clustering based collaborative filtering algorithm. We study both the online and offline variants of this algorithm. Theoretical analysis of the time complexity of our algorithm proves the efficacy of our approach. Further, we present the impact of load balancing based optimizations on multi-core cluster architectures. Using the Netflix dataset(900M training ratings with replication) as well as the Yahoo KDD Cup(2.3B training ratings with replication) datasets, we demonstrate the performance and scalability of our algorithm on a large multi-core cluster architecture. In offline mode, our distributed algorithm demonstrates around 4x better performance (on Blue Gene/P) as compared to the best prior work, along with high accuracy. In online mode, we demonstrated around 3x better performance compared to baseline MPI implementation. To the best of our knowledge, our algorithm provides the best known online and offline performance and scalability results with high accuracy on multi-core cluster architectures.
Keywords :
collaborative filtering; computational complexity; distributed algorithms; multiprocessing systems; resource allocation; Netflix dataset; analytics kernels; clustering algorithm; data analytics; distributed algorithm; high performance offline distributed collaborative filtering; load balancing based optimization; multicore cluster architecture; online distributed collaborative filtering algorithm; time complexity; Algorithm design and analysis; Approximation methods; Clustering algorithms; Collaboration; Matrix decomposition; Partitioning algorithms; Training; Distributed Collaborative Filtering; Parallel Performance Optimizations; Performance & Scalability Analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining (ICDM), 2012 IEEE 12th International Conference on
Conference_Location :
Brussels
ISSN :
1550-4786
Print_ISBN :
978-1-4673-4649-8
Type :
conf
DOI :
10.1109/ICDM.2012.128
Filename :
6413871
Link To Document :
بازگشت