DocumentCode
50074
Title
Incremental Fuzzy Clustering With Multiple Medoids for Large Data
Author
Yangtao Wang ; Lihui Chen ; Jian-Ping Mei
Author_Institution
Div. of Inf. Eng., Nanyang Technol. Univ., Singapore, Singapore
Volume
22
Issue
6
fYear
2014
fDate
Dec. 2014
Firstpage
1557
Lastpage
1568
Abstract
As an important technique of data analysis, clustering plays an important role in finding the underlying pattern structure embedded in unlabeled data. Clustering algorithms that need to store all the data into the memory for analysis become infeasible when the dataset is too large to be stored. To handle such large data, incremental clustering approaches are proposed. The key idea behind these approaches is to find representatives (centroids or medoids) to represent each cluster in each data chunk, which is a packet of the data, and final data analysis is carried out based on those identified representatives from all the chunks. In this paper, we propose a new incremental clustering approach called incremental multiple medoids-based fuzzy clustering (IMMFC) to handle complex patterns that are not compact and well separated. We would like to investigate whether IMMFC is a good alternative to capturing the underlying data structure more accurately. IMMFC not only facilitates the selection of multiple medoids for each cluster in a data chunk, but also has the mechanism to make use of relationships among those identified medoids as side information to help the final data clustering process. The detailed problem formulation, updating rules derivation, and the in-depth analysis of the proposed IMMFC are provided. Experimental studies on several large datasets that include real world malware datasets have been conducted. IMMFC outperforms existing incremental fuzzy clustering approaches in terms of clustering accuracy and robustness to the order of data. These results demonstrate the great potential of IMMFC for large-data analysis.
Keywords
data analysis; data structures; fuzzy set theory; pattern clustering; IMMFC; centroids; complex pattern handling; data analysis; data structure; incremental clustering approaches; incremental fuzzy clustering; incremental multiple medoids-based fuzzy clustering; malware datasets; pattern structure; rules derivation; Algorithm design and analysis; Clustering algorithms; Data analysis; Linear programming; TV; Time complexity; Vectors; Fuzzy clustering; incremental clustering; large data; malware clustering; multiple medoids;
fLanguage
English
Journal_Title
Fuzzy Systems, IEEE Transactions on
Publisher
ieee
ISSN
1063-6706
Type
jour
DOI
10.1109/TFUZZ.2014.2298244
Filename
6704313
Link To Document