Title : 
Using Mahout for Clustering Wikipedia´s Latest Articles: A Comparison between K-means and Fuzzy C-means in the Cloud
         
        
            Author : 
Esteves, Rui Máximo ; Rong, Chunming
         
        
            Author_Institution : 
Dept. of Electr. & Comput. Eng., Univ. of Stavanger, Stavanger, Norway
         
        
        
            fDate : 
Nov. 29 2011-Dec. 1 2011
         
        
        
        
            Abstract : 
This paper compares k-means and fuzzy c-means for clustering a noisy realistic and big dataset. We made the comparison using a free cloud computing solution Apache Mahout/ Hadoop and Wikipedia´s latest articles. In the past the usage of these two algorithms was restricted to small datasets. As so, studies were based on artificial datasets that do not represent a real document clustering situation. With this ongoing research we found that in a noisy dataset, fuzzy c-means can lead to worse cluster quality than k-means. The convergence speed of k-means is not always faster. We found as well that Mahout is a promise clustering technology but the preprocessing tools are not developed enough for an efficient dimensionality reduction. From our experience the use of the Apache Mahout is premature.
         
        
            Keywords : 
Web sites; cloud computing; document handling; fuzzy set theory; pattern clustering; Apache Mahout; Hadoop; Wikipedia latest article clustering; artificial datasets; cluster quality; free cloud computing solution; fuzzy c-means clustering; k-means clustering; noisy realistic dataset; real document clustering; Clustering algorithms; Convergence; Electronic publishing; Encyclopedias; Internet; Vectors; Mahout; document clustering; fuzzy c-means; k-means;
         
        
        
        
            Conference_Titel : 
Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on
         
        
            Conference_Location : 
Athens
         
        
            Print_ISBN : 
978-1-4673-0090-2
         
        
        
            DOI : 
10.1109/CloudCom.2011.86