Cluster analysis for the cloud: Parallel competitive fitness and parallel K-means++ for large dataset analysis

Author

Esteves, R.M. ; Hacker, T. ; Chunming Rong

Author_Institution

Dept. of Electr. & Comput. Eng., Univ. of Stavanger, Stavanger, Norway

fYear

2012

fDate

3-6 Dec. 2012

Firstpage

177

Lastpage

184

Abstract

The amount of resources needed to provision Virtual Machines (VM) in a cloud computing systems to support virtual HPC clusters can be predicted from the analysis of historic use data. In previous work, Hacker et al. found that cluster analysis is a useful tool to understand the underlying spatio-temporal dependencies present in system fault and use logs. However, the cluster analysis used for reducing spatio-temporal dependences should be fast and accurate to understand the underlying stochastic properties of these systems. K-means is a fast cluster analysis method, in which accuracy depends on the use of initialization algorithms that are usually serial and slow. In this paper we present two new parallel strategies for fast seeding K-means cluster analysis. Both strategies were tested on a real problem where the aim was to reduce spatial and temporal dependencies of failures on large supercomputer systems. The performance of both strategies were compared with five existing serial implementations: K-means implementations of 1) Lloyd (L); 2) McQueen (M); and 3) Hartigan - Wong (HW), all of them using Forgy seeding; 4) K-means++; and 5) Neural Gas clustering (NG), a more recent and sophisticated method. Our results show that our new Parallel Competitive Fitness approach reduces the Within Sum of Squares (WSQQ) measure, thus increasing cluster quality of the three K-means implementations: L; M; HW, and is 200 times faster than the existing serial K-means++. The existing serial and our new Parallel K-means++ have the lowest WSQQ. Our new Parallel K-means++ is twice as fast as the existing serial K-means++ method, and is 4 times faster than the NG method. Moreover, our new methods did not generate empty clusters, while NG did. As a result of our new techniques, predicting the amount of resources needed to provision VMs processing historic system fault and use data can now be done faster and with more accuracy.

Keywords

cloud computing; competitive algorithms; mainframes; parallel processing; pattern clustering; virtual machines; HW K-means implementation; Hartigan-Wong K-means implementations; L K-means implementation; Lloyd K-means implementations; M K-means implementation; McQueen K-means implementations; NG method; VMs processing historic system fault; WSQQ measurement; cloud computing systems; fast seeding K-means cluster analysis; forgy seeding; historic use data; initialization algorithms; large dataset analysis; large supercomputer systems; neural gas clustering; parallel K-means++; parallel competitive fitness approach; parallel strategies; spatial dependencies reduction; spatio-temporal dependencies; stochastic properties; system fault; virtual HPC cluster analysis; virtual machines; within sum of squares measurement; Accuracy; Algorithm design and analysis; Cloud computing; Clustering algorithms; Computer hacking; Conferences; Partitioning algorithms; Clustering; K-means; Neural Gas; VM resources prediction; parallel K-means++;

fLanguage

English

Publisher

ieee

Conference_Titel

Cloud Computing Technology and Science (CloudCom), 2012 IEEE 4th International Conference on

Conference_Location

Taipei

Print_ISBN

978-1-4673-4511-8

Electronic_ISBN

978-1-4673-4509-5

Type

conf

DOI

10.1109/CloudCom.2012.6427553

Filename

6427553

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=592843