Title :
Efficient large scale distributed matrix computation with spark
Author :
Rong Gu;Yun Tang;Zhaokang Wang;Shuai Wang;Xusen Yin;Chunfeng Yuan;Yihua Huang
Author_Institution :
National Key Laboratory for Novel Software Technology, Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing University, Nanjing, China 210093
Abstract :
Matrix computation is the core of many massive data-intensive analytical applications such mining social networks, recommendation systems and nature language processing. Due to the importance of matrix computation, it has been widely studied for many years. In the Big Data ear, as the scale of the matrix grows, traditional single-node matrix computation systems can hardly cope with such large data and computation. Existing distributed matrix computation solutions are still not efficient enough, or have poor fault tolerance and usability. In this paper, we propose Marlin, an efficient distributed matrix computation library which is built on top of Spark. Marlin contains several distributed matrix operation algorithms and provides high-level matrix computation primitives for users. In Marlin, we proposed three distributed matrix multiplication algorithms for different situations. Based on this, we designed an adaptive model to choose the best approach for different problems. Moreover, to improve the computation performance, instead of naively using Spark, we put forward some optimizations including taking advantage of the native linear algebra library, reducing shuffle communication and increasing parallelism. Experimental results show that Marlin is over an order of magnitude faster than R (a widely-used statistical computing system) and the existing distributed matrix operation algorithms based on MapReduce. Moreover, Marlin achieves comparable performance to the specialized MPI-based matrix multiplication algorithm SUMMA but uses a general dataflow engine and gains common dataflow features such as scalability and fault tolerance.
Keywords :
"Sparks","Libraries","Fault tolerance","Fault tolerant systems","Big data","Algorithm design and analysis","Linear algebra"
Conference_Titel :
Big Data (Big Data), 2015 IEEE International Conference on
DOI :
10.1109/BigData.2015.7364023