• DocumentCode
    140743
  • Title

    Finding common ground among experts´ opinions on data clustering: With applications in malware analysis

  • Author

    Guanhua Yan

  • Author_Institution
    Inf. Sci. Group (CCS-3), Los Alamos Nat. Lab., Los Alamos, NM, USA
  • fYear
    2014
  • fDate
    March 31 2014-April 4 2014
  • Firstpage
    15
  • Lastpage
    27
  • Abstract
    Data clustering is a basic technique for knowledge discovery and data mining. As the volume of data grows significantly, data clustering becomes computationally prohibitive and resource demanding, and sometimes it is necessary to outsource these tasks to third party experts who specialize in data clustering. The goal of this work is to develop techniques that find common ground among experts´ opinions on data clustering, which may be biased due to the features or algorithms used in clustering. Our work differs from the large body of existing approaches to consensus clustering, as we do not require all data objects be grouped into clusters. Rather, our work is motivated by real-world applications that demand high confidence in how data objects - if they are selected - are grouped together.We formulate the problem rigorously and show that it is NP-complete. We further develop a lightweight technique based on finding a maximum independent set in a 3-uniform hypergraph to select data objects that do not form conflicts among experts´ opinions. We apply our proposed method to a real-world malware dataset with hundreds of thousands of instances to find malware clusters based on how multiple major AV (Anti-Virus) software classify these samples. Our work offers a new direction for consensus clustering by striking a balance between the clustering quality and the amount of data objects chosen to be clustered.
  • Keywords
    computational complexity; computer viruses; data mining; graph theory; pattern clustering; 3-uniform hypergraph; AV software; NP-complete; antivirus software; clustering quality; common ground; consensus clustering; data clustering; data mining; data objects; expert opinions; knowledge discovery; malware analysis; malware clusters; Feature extraction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2014 IEEE 30th International Conference on
  • Conference_Location
    Chicago, IL
  • Type

    conf

  • DOI
    10.1109/ICDE.2014.6816636
  • Filename
    6816636