DocumentCode
592843
Title
Cluster analysis for the cloud: Parallel competitive fitness and parallel K-means++ for large dataset analysis
Author
Esteves, R.M. ; Hacker, T. ; Chunming Rong
Author_Institution
Dept. of Electr. & Comput. Eng., Univ. of Stavanger, Stavanger, Norway
fYear
2012
fDate
3-6 Dec. 2012
Firstpage
177
Lastpage
184
Abstract
The amount of resources needed to provision Virtual Machines (VM) in a cloud computing systems to support virtual HPC clusters can be predicted from the analysis of historic use data. In previous work, Hacker et al. found that cluster analysis is a useful tool to understand the underlying spatio-temporal dependencies present in system fault and use logs. However, the cluster analysis used for reducing spatio-temporal dependences should be fast and accurate to understand the underlying stochastic properties of these systems. K-means is a fast cluster analysis method, in which accuracy depends on the use of initialization algorithms that are usually serial and slow. In this paper we present two new parallel strategies for fast seeding K-means cluster analysis. Both strategies were tested on a real problem where the aim was to reduce spatial and temporal dependencies of failures on large supercomputer systems. The performance of both strategies were compared with five existing serial implementations: K-means implementations of 1) Lloyd (L); 2) McQueen (M); and 3) Hartigan - Wong (HW), all of them using Forgy seeding; 4) K-means++; and 5) Neural Gas clustering (NG), a more recent and sophisticated method. Our results show that our new Parallel Competitive Fitness approach reduces the Within Sum of Squares (WSQQ) measure, thus increasing cluster quality of the three K-means implementations: L; M; HW, and is 200 times faster than the existing serial K-means++. The existing serial and our new Parallel K-means++ have the lowest WSQQ. Our new Parallel K-means++ is twice as fast as the existing serial K-means++ method, and is 4 times faster than the NG method. Moreover, our new methods did not generate empty clusters, while NG did. As a result of our new techniques, predicting the amount of resources needed to provision VMs processing historic system fault and use data can now be done faster and with more accuracy.
Keywords
cloud computing; competitive algorithms; mainframes; parallel processing; pattern clustering; virtual machines; HW K-means implementation; Hartigan-Wong K-means implementations; L K-means implementation; Lloyd K-means implementations; M K-means implementation; McQueen K-means implementations; NG method; VMs processing historic system fault; WSQQ measurement; cloud computing systems; fast seeding K-means cluster analysis; forgy seeding; historic use data; initialization algorithms; large dataset analysis; large supercomputer systems; neural gas clustering; parallel K-means++; parallel competitive fitness approach; parallel strategies; spatial dependencies reduction; spatio-temporal dependencies; stochastic properties; system fault; virtual HPC cluster analysis; virtual machines; within sum of squares measurement; Accuracy; Algorithm design and analysis; Cloud computing; Clustering algorithms; Computer hacking; Conferences; Partitioning algorithms; Clustering; K-means; Neural Gas; VM resources prediction; parallel K-means++;
fLanguage
English
Publisher
ieee
Conference_Titel
Cloud Computing Technology and Science (CloudCom), 2012 IEEE 4th International Conference on
Conference_Location
Taipei
Print_ISBN
978-1-4673-4511-8
Electronic_ISBN
978-1-4673-4509-5
Type
conf
DOI
10.1109/CloudCom.2012.6427553
Filename
6427553
Link To Document