DocumentCode :
651573
Title :
Mining Web Technical Discussions to Identify Malware Capabilities
Author :
Saxe, Joshua ; Mentis, David ; Greamo, Christopher
Author_Institution :
Invincea Inc., Fairfax, VA, USA
fYear :
2013
fDate :
8-11 July 2013
Firstpage :
1
Lastpage :
5
Abstract :
The exponential growth of unique malware binary artifacts has led researchers to explore automated techniques for characterizing unknown malware binaries´ capabilities. Thus far, automatic malware analysis systems have relied on labeled training data and analyst defined rules to identify malware samples´ software features and functional categories. Such approaches require substantial expert analyst effort to maintain, as malware authors change programming languages, APIs, malicious tactics, and operating system targets. In this paper we present preliminary results demonstrating the viability of a new research direction for malware capability identification that addresses these issues, the concept of mining web technical documentation to automatically identify malware capabilities. This approach does not require expert generation of rules or training labels and automatically stays up to date with the latest software engineering trends. We make two contributions aimed at demonstrating the value of this research direction: first, with a corpus of 6 million web technical postings from the programming question and answer website StackOverflow.com, we show that symbols found in a corpus of malicious executable files, such as registry keys, file names, and API call names, also occur frequently in the StackOverflow data, suggesting that applying natural language processing to the StackOverflow posts (and other technical documents) may help us automatically generate characterizations of technical symbols, and, thereby, capabilities, found in malware. Our second contribution is to show that by analyzing function call symbol co-occurrence within StackOverflow posts, as well as the semantic tags associated with these posts, we can create function relationship graphs over the symbols which show promise in helping to identifying malware software capabilities. We argue that these early findings demonstrate the promise of a web technical document based approach to automating mal- are capability identification.
Keywords :
Internet; Web sites; application program interfaces; data mining; invasive software; natural language processing; system documentation; API call names; APIs; StackOverflow data; StackOverflow posts; Web technical discussion mining; Web technical document based approach; Web technical documentation mining; answer Web site StackOverflow.com; automatic malware analysis systems; file names; function call symbol co-occurrence analysis; function relationship graphs; functional category; labeled training data; malicious tactics; malware capability identification; malware sample software feature identification; natural language processing; operating system targets; programming languages; programming question; registry keys; semantic tags; software engineering; training labels; unique malware binary artifacts; Clustering algorithms; Data mining; Internet; Malware; Programming; Webcams; computer security; data mining; machine learning; malware analysis; natural language processing; statistical modeling;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Distributed Computing Systems Workshops (ICDCSW), 2013 IEEE 33rd International Conference on
Conference_Location :
Philadelphia, PA
Print_ISBN :
978-1-4799-3247-4
Type :
conf
DOI :
10.1109/ICDCSW.2013.56
Filename :
6679853
Link To Document :
بازگشت