• DocumentCode
    1974647
  • Title

    ARIGUMA Code Analyzer: Efficient Variant Detection by Identifying Common Instruction Sequences in Malware Families

  • Author

    Yang Zhong ; Yamaki, Hirofumi ; Yamaguchi, Yoshio ; Takakura, Hiroki

  • Author_Institution
    Grad. Sch. of Inf. Sci., Nagoya Univ., Nagoya, Japan
  • fYear
    2013
  • fDate
    22-26 July 2013
  • Firstpage
    11
  • Lastpage
    20
  • Abstract
    It is required in the first step of malware analysis to determine whether a given malware program is a variant of known ones. If it is surely not a variant, manual analysis against it is required. However, it is impossible to perform manual analysis, the cost of which is very high, over all the enormous number of newly found malware programs. An automatic and accurate malware program classification method should contribute to this situation. Existing methods suffer from such problems as the cost of calculating similarity between every pair of malware programs in a database, and the disability to precisely present the similarity and the difference between programs. In our approach, known malware programs are classified into families. A given malware program is determined to be a variant if it is classified into an existing family. Incremental clustering is then performed for the new one and the family, which reduces the cost of re-training and similarity calculation. Accurate comparison between programs is enabled by evaluating the difference between programs using the longest common subsequences (LCSs) of instructions. To reduce the amount of the costly calculation of LCSs, the numeric features of codes, such as cyclomatic complexity, the number of function calls and so on, are used to filter out dissimilar codes. Subsequences in the LCS of two codes are presented to malware analysts as the similarity between them, while those out of it are given as the difference. Experimental results show that this method can detect the name of APIs used in a malware which existing methods cannot, that it is useful to determine inserted codes which is used for generating variants to avoid pattern detection by anti-virus, and that it actually reduces the time to process malware programs without deteriorating the accuracy of classification.
  • Keywords
    application program interfaces; invasive software; pattern classification; pattern clustering; API; ARIGUMA code analyzer; LCS; common instruction sequence identification; cyclomatic complexity; function calls; incremental clustering; longest common subsequences; malware analysis; malware families; malware program classification; variant detection; Clustering algorithms; Databases; Feature extraction; Malware; Manuals; Training; Vectors; LCS; incremental clustering; malware classification; static analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Software and Applications Conference (COMPSAC), 2013 IEEE 37th Annual
  • Conference_Location
    Kyoto
  • Type

    conf

  • DOI
    10.1109/COMPSAC.2013.6
  • Filename
    6649793