• DocumentCode
    592843
  • Title

    Cluster analysis for the cloud: Parallel competitive fitness and parallel K-means++ for large dataset analysis

  • Author

    Esteves, R.M. ; Hacker, T. ; Chunming Rong

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Univ. of Stavanger, Stavanger, Norway
  • fYear
    2012
  • fDate
    3-6 Dec. 2012
  • Firstpage
    177
  • Lastpage
    184
  • Abstract
    The amount of resources needed to provision Virtual Machines (VM) in a cloud computing systems to support virtual HPC clusters can be predicted from the analysis of historic use data. In previous work, Hacker et al. found that cluster analysis is a useful tool to understand the underlying spatio-temporal dependencies present in system fault and use logs. However, the cluster analysis used for reducing spatio-temporal dependences should be fast and accurate to understand the underlying stochastic properties of these systems. K-means is a fast cluster analysis method, in which accuracy depends on the use of initialization algorithms that are usually serial and slow. In this paper we present two new parallel strategies for fast seeding K-means cluster analysis. Both strategies were tested on a real problem where the aim was to reduce spatial and temporal dependencies of failures on large supercomputer systems. The performance of both strategies were compared with five existing serial implementations: K-means implementations of 1) Lloyd (L); 2) McQueen (M); and 3) Hartigan - Wong (HW), all of them using Forgy seeding; 4) K-means++; and 5) Neural Gas clustering (NG), a more recent and sophisticated method. Our results show that our new Parallel Competitive Fitness approach reduces the Within Sum of Squares (WSQQ) measure, thus increasing cluster quality of the three K-means implementations: L; M; HW, and is 200 times faster than the existing serial K-means++. The existing serial and our new Parallel K-means++ have the lowest WSQQ. Our new Parallel K-means++ is twice as fast as the existing serial K-means++ method, and is 4 times faster than the NG method. Moreover, our new methods did not generate empty clusters, while NG did. As a result of our new techniques, predicting the amount of resources needed to provision VMs processing historic system fault and use data can now be done faster and with more accuracy.
  • Keywords
    cloud computing; competitive algorithms; mainframes; parallel processing; pattern clustering; virtual machines; HW K-means implementation; Hartigan-Wong K-means implementations; L K-means implementation; Lloyd K-means implementations; M K-means implementation; McQueen K-means implementations; NG method; VMs processing historic system fault; WSQQ measurement; cloud computing systems; fast seeding K-means cluster analysis; forgy seeding; historic use data; initialization algorithms; large dataset analysis; large supercomputer systems; neural gas clustering; parallel K-means++; parallel competitive fitness approach; parallel strategies; spatial dependencies reduction; spatio-temporal dependencies; stochastic properties; system fault; virtual HPC cluster analysis; virtual machines; within sum of squares measurement; Accuracy; Algorithm design and analysis; Cloud computing; Clustering algorithms; Computer hacking; Conferences; Partitioning algorithms; Clustering; K-means; Neural Gas; VM resources prediction; parallel K-means++;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cloud Computing Technology and Science (CloudCom), 2012 IEEE 4th International Conference on
  • Conference_Location
    Taipei
  • Print_ISBN
    978-1-4673-4511-8
  • Electronic_ISBN
    978-1-4673-4509-5
  • Type

    conf

  • DOI
    10.1109/CloudCom.2012.6427553
  • Filename
    6427553