• DocumentCode
    589281
  • Title

    Binary Function Clustering Using Semantic Hashes

  • Author

    Jin, Weiwei ; Chaki, Sagar ; Cohen, C. ; Gurfinkel, Arie ; Havrilla, J. ; Hines, C. ; Narasimhan, Priya

  • Author_Institution
    Carnegie Mellon Univ., Pittsburgh, PA, USA
  • Volume
    1
  • fYear
    2012
  • fDate
    12-15 Dec. 2012
  • Firstpage
    386
  • Lastpage
    391
  • Abstract
    The ability to identify semantically-related functions, in large collections of binary executables, is important for malware detection. Intuitively, two pieces of code are similar if they have the same effect on a machine\´s state. Current state-of-the-art tools employ a variety of pair wise comparisons (e.g., template matching using SMT solvers, Value-Set analysis at critical program points, API call matching, etc.) However, these methods are unshakable for clustering large datasets, of size N, since they require O(N2) comparisons. In this paper, we present an alternative approach based upon "hashing". We propose a scheme that captures the semantics of functions as semantic hashes. Our approach treats a function as a set of features, each of which represent the input-output behavior of a basic block. Using a form of locality-sensitive hashing known as Min Hashing, functions with many common features can be quickly identified, and the complexity of clustering is reduced to O(N). Experiments on functions extracted from the CERT malware catalog indicate that we are able to cluster closely related code with a low false positive rate.
  • Keywords
    computational complexity; file organisation; invasive software; pattern clustering; program diagnostics; CERT malware catalog; MinHashing; binary executables; binary function clustering; clustering complexity reduction; code clustering; input-output behavior; large datasets clustering; locality-sensitive hashing; low false positive rate; malware detection; pairwise comparisons; semantic hash; semantically-related function identification; Benchmark testing; Catalogs; Concrete; Feature extraction; Malware; Registers; Semantics; binary static analysis; clustering; malware detection; reverse engineering; semantic comparison;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Applications (ICMLA), 2012 11th International Conference on
  • Conference_Location
    Boca Raton, FL
  • Print_ISBN
    978-1-4673-4651-1
  • Type

    conf

  • DOI
    10.1109/ICMLA.2012.70
  • Filename
    6406693