A Novel Clustering Algorithm on Large-Scale Graph Data

Author

Hao Zhang ; Wei Zhou ; Xiaoyu Wan ; Ge Fu ; Zhiyong Xu ; Jizhong Han

Author_Institution

Sch. of Commun. & Inf. Eng., Chongqing Univ. of Posts & Telecommun., Chongqing, China

fYear

2014

Firstpage

Lastpage

Abstract

Large scale Social Network Service (SNS) applications use graph as the default data organization model. To discover hidden useful information, data clustering techniques are widely adopted in data analysis tasks. State-of-the-art clustering algorithms are based on distributed programming paradigms such as MapReduce and Bulk Synchronization Parallel (BSP) models. These solutions suffer from heavy data transmission cost as well as excessive storage overhead problems. Thus, they can hardly achieve the optimal performance. In this paper, we propose a novel clustering algorithm based on pagerank to relieve these issues. Compare to current systems, our algorithm can achieve similar clustering results on large-scale graph data with much lower network and storage overhead. It consists of three steps. First, it calculates the pagerank values for each vertex based on the underlying data set. Then, we select multiple vertexes from the list as clustering centers. Third, our algorithm expands the range of the cluster by adding more neighboring vertexes iteratively. We compare our algorithm with other popular clustering algorithms using real world data sets and the results show that it achieves better performance than any other distributed solutions. For example, on a web graph data set with hundreds of thousands of vertexes and millions of edges, the time and memory consumption can be reduced by more than 95% and 70%, respectively.

Keywords

Internet; data analysis; distributed programming; graph theory; parallel processing; pattern clustering; social networking (online); storage management; BSP model; MapReduce; SNS application; Web graph data set; bulk synchronization parallel model; clustering center; data analysis task; data clustering technique; default data organization model; distributed programming paradigm; distributed solution; heavy data transmission cost; large-scale graph data; memory consumption; optimal performance; pagerank; popular clustering algorithm; real world data set; social network service application; storage overhead problem; Algorithm design and analysis; Attenuation; Clustering algorithms; Computational modeling; Data analysis; Distributed databases;

fLanguage

English

Publisher

ieee

Conference_Titel

Cloud Computing and Big Data (CCBD), 2014 International Conference on

Type

conf

DOI

10.1109/CCBD.2014.23

Filename

7062871

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=3579180