• DocumentCode
    1484809
  • Title

    Efficient and Accurate Discovery of Patterns in Sequence Data Sets

  • Author

    Floratou, Avrilia ; Tata, Sandeep ; Patel, Jignesh M.

  • Author_Institution
    Comput. Sci. Dept., Univ. of Wisconsin-Madison, Madison, WI, USA
  • Volume
    23
  • Issue
    8
  • fYear
    2011
  • Firstpage
    1154
  • Lastpage
    1168
  • Abstract
    Existing sequence mining algorithms mostly focus on mining for subsequences. However, a large class of applications, such as biological DNA and protein motif mining, require efficient mining of “approximate” patterns that are contiguous. The few existing algorithms that can be applied to find such contiguous approximate pattern mining have drawbacks like poor scalability, lack of guarantees in finding the pattern, and difficulty in adapting to other applications. In this paper, we present a new algorithm called FLexible and Accurate Motif DEtector (FLAME). FLAME is a flexible suffix-tree-based algorithm that can be used to find frequent patterns with a variety of definitions of motif (pattern) models. It is also accurate, as it always finds the pattern if it exists. Using both real and synthetic data sets, we demonstrate that FLAME is fast, scalable, and outperforms existing algorithms on a variety of performance metrics. In addition, based on FLAME, we also address a more general problem, named extended structured motif extraction, which allows mining frequent combinations of motifs under relaxed constraints.
  • Keywords
    data mining; trees (mathematics); FLAME; flexible and accurate motif detector; sequence data sets; sequence mining algorithms; suffix-tree-based algorithm; Approximation algorithms; Biological system modeling; Computational modeling; DNA; Data mining; Fires; Proteins; Motif; sequence mining; suffix tree.;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2011.69
  • Filename
    5740881