DocumentCode
589281
Title
Binary Function Clustering Using Semantic Hashes
Author
Jin, Weiwei ; Chaki, Sagar ; Cohen, C. ; Gurfinkel, Arie ; Havrilla, J. ; Hines, C. ; Narasimhan, Priya
Author_Institution
Carnegie Mellon Univ., Pittsburgh, PA, USA
Volume
1
fYear
2012
fDate
12-15 Dec. 2012
Firstpage
386
Lastpage
391
Abstract
The ability to identify semantically-related functions, in large collections of binary executables, is important for malware detection. Intuitively, two pieces of code are similar if they have the same effect on a machine\´s state. Current state-of-the-art tools employ a variety of pair wise comparisons (e.g., template matching using SMT solvers, Value-Set analysis at critical program points, API call matching, etc.) However, these methods are unshakable for clustering large datasets, of size N, since they require O(N2) comparisons. In this paper, we present an alternative approach based upon "hashing". We propose a scheme that captures the semantics of functions as semantic hashes. Our approach treats a function as a set of features, each of which represent the input-output behavior of a basic block. Using a form of locality-sensitive hashing known as Min Hashing, functions with many common features can be quickly identified, and the complexity of clustering is reduced to O(N). Experiments on functions extracted from the CERT malware catalog indicate that we are able to cluster closely related code with a low false positive rate.
Keywords
computational complexity; file organisation; invasive software; pattern clustering; program diagnostics; CERT malware catalog; MinHashing; binary executables; binary function clustering; clustering complexity reduction; code clustering; input-output behavior; large datasets clustering; locality-sensitive hashing; low false positive rate; malware detection; pairwise comparisons; semantic hash; semantically-related function identification; Benchmark testing; Catalogs; Concrete; Feature extraction; Malware; Registers; Semantics; binary static analysis; clustering; malware detection; reverse engineering; semantic comparison;
fLanguage
English
Publisher
ieee
Conference_Titel
Machine Learning and Applications (ICMLA), 2012 11th International Conference on
Conference_Location
Boca Raton, FL
Print_ISBN
978-1-4673-4651-1
Type
conf
DOI
10.1109/ICMLA.2012.70
Filename
6406693
Link To Document