• DocumentCode
    75995
  • Title

    Toward Unsupervised Protocol Feature Word Extraction

  • Author

    Zhuo Zhang ; Zhibin Zhang ; Lee, Patrick P. C. ; Yunjie Liu ; Gaogang Xie

  • Author_Institution
    Inst. of Comput. Technol., Beijing, China
  • Volume
    32
  • Issue
    10
  • fYear
    2014
  • fDate
    Oct. 2014
  • Firstpage
    1894
  • Lastpage
    1906
  • Abstract
    Protocol feature words are byte subsequences within traffic payload that can distinguish application protocols, and they form the building blocks of many constructions of deep packet analysis rules in network management, measurement, and security systems. However, how to systematically and efficiently extract protocol feature words from network traffic remains a challenging issue. Existing approaches like those based on n-gram or Common String (CS), which simply breaks payload into equal-length pieces or attempts to find a frequent itemset, are ineffective in capturing the hidden statistical structure of the payload content. In this paper, we propose ProWord, an unsupervised approach that extracts protocol feature words from traffic traces. ProWord builds on two nontrivial algorithms. First, we propose an unsupervised segmentation algorithm based on the modified Voting Experts algorithm, such that we break payload into candidate words according to entropy information and provide more accurate segmentation than existing n-gram and CS approaches. Second, we propose a ranking algorithm that incorporates different types of well-known feature word retrieval heuristics, such that we can build an ordered structure on the candidate words and select the highest ranked ones as protocol feature words. We compare ProWord and existing prior approaches via evaluation on real-world traffic traces. We show that ProWord captures true protocol feature words more accurately and performs significantly faster.
  • Keywords
    Internet; computer network management; protocols; unsupervised learning; ProWord approach; application protocols; common string; deep packet analysis rules; entropy information; feature word retrieval heuristics; n-gram; network management system; network measurement system; network security system; ranking algorithm; traffic payload; unsupervised protocol feature word extraction; unsupervised segmentation algorithm; voting experts algorithm; Algorithm design and analysis; Entropy; Feature extraction; Partitioning algorithms; Payloads; Protocols; Redundancy; Network traffic analysis; network traffic identification; protocol reverse engineering; unsupervised information extraction;
  • fLanguage
    English
  • Journal_Title
    Selected Areas in Communications, IEEE Journal on
  • Publisher
    ieee
  • ISSN
    0733-8716
  • Type

    jour

  • DOI
    10.1109/JSAC.2014.2358857
  • Filename
    6902777