DocumentCode :
140743
Title :
Finding common ground among experts´ opinions on data clustering: With applications in malware analysis
Author :
Guanhua Yan
Author_Institution :
Inf. Sci. Group (CCS-3), Los Alamos Nat. Lab., Los Alamos, NM, USA
fYear :
2014
fDate :
March 31 2014-April 4 2014
Firstpage :
15
Lastpage :
27
Abstract :
Data clustering is a basic technique for knowledge discovery and data mining. As the volume of data grows significantly, data clustering becomes computationally prohibitive and resource demanding, and sometimes it is necessary to outsource these tasks to third party experts who specialize in data clustering. The goal of this work is to develop techniques that find common ground among experts´ opinions on data clustering, which may be biased due to the features or algorithms used in clustering. Our work differs from the large body of existing approaches to consensus clustering, as we do not require all data objects be grouped into clusters. Rather, our work is motivated by real-world applications that demand high confidence in how data objects - if they are selected - are grouped together.We formulate the problem rigorously and show that it is NP-complete. We further develop a lightweight technique based on finding a maximum independent set in a 3-uniform hypergraph to select data objects that do not form conflicts among experts´ opinions. We apply our proposed method to a real-world malware dataset with hundreds of thousands of instances to find malware clusters based on how multiple major AV (Anti-Virus) software classify these samples. Our work offers a new direction for consensus clustering by striking a balance between the clustering quality and the amount of data objects chosen to be clustered.
Keywords :
computational complexity; computer viruses; data mining; graph theory; pattern clustering; 3-uniform hypergraph; AV software; NP-complete; antivirus software; clustering quality; common ground; consensus clustering; data clustering; data mining; data objects; expert opinions; knowledge discovery; malware analysis; malware clusters; Feature extraction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering (ICDE), 2014 IEEE 30th International Conference on
Conference_Location :
Chicago, IL
Type :
conf
DOI :
10.1109/ICDE.2014.6816636
Filename :
6816636
Link To Document :
بازگشت