Title :
A generalized bio-inspired method for discovering sequence-based signatures
Author :
Peterson, Eric ; Curtis, D. ; Phillips, Andrew ; Teuton, Jeremy ; Oehmen, Christopher
Author_Institution :
Pacific Northwest Nat. Lab., Richland, WA, USA
Abstract :
Many phenomena that we wish to discover are comprised of sequences of events or event primitives. Often signatures are constructed to identify such phenomena using either distributions or frequencies of attributes, or specific subsequences that are known to correlate to the phenomena. Distribution-based identification does not capture the essence of the sequence of behaviors and therefore may suffer from lack of specificity. At the other extreme, using specific subsequences to identify target phenomena is often too specific and suffers from lower sensitivity when natural variations arise in the phenomena, measuring process, or data analysis. We introduce here a method for discovering signatures for phenomena that are well characterized by sequences of event primitives. In this paper, we describe the steps taken and lessons learned in generalizing a sequence analysis method, BLAST, for use on non-biological datasets including expressing and operating on alphabets of varying length, constructing a reward/penalty model for arbitrary datasets, and discovering low complexity segments in sequence data by extending BLAST´s native low-complexity estimating algorithms. We also present high-level overviews of several case studies that demonstrate the utility of this method to discovering signatures in a wide array of applications including network traffic, software analysis, server characterization, and others. Finally, we demonstrate how signatures discovered using this method can be expressed using a variety of model formalisms, each having its own relative benefit.
Keywords :
bioinformatics; computational complexity; information filtering; text analysis; BLAST sequence analysis method; alphabet length; arbitrary datasets; attribute frequencies; distribution-based identification; event primitive sequences; generalized bio-inspired method; low-complexity estimation algorithms; low-complexity segments; nonbiological datasets; penalty model; reward model; sequence-based signature discovery; target phenomena identification; Bioinformatics; Cloning; Complexity theory; Laboratories; Matrices; Servers; Software; alignment; cybersecurity; sequence analysis; signature;
Conference_Titel :
Intelligence and Security Informatics (ISI), 2013 IEEE International Conference on
Conference_Location :
Seattle, WA
Print_ISBN :
978-1-4673-6214-6
DOI :
10.1109/ISI.2013.6578853