• DocumentCode
    3557910
  • Title

    A Monte Carlo EM Algorithm for De Novo Motif Discovery in Biomolecular Sequences

  • Author

    Bi, Chengpeng

  • Author_Institution
    Div. of Clinical Pharmacology, Children´´s Mercy Hosp. & Clinics, Kansas City, MO, USA
  • Volume
    6
  • Issue
    3
  • fYear
    2009
  • Firstpage
    370
  • Lastpage
    386
  • Abstract
    Motif discovery methods play pivotal roles in deciphering the genetic regulatory codes (i.e., motifs) in genomes as well as in locating conserved domains in protein sequences. The Expectation Maximization (EM) algorithm is one of the most popular methods used in de novo motif discovery. Based on the position weight matrix (PWM) updating technique, this paper presents a Monte Carlo version of the EM motif-finding algorithm that carries out stochastic sampling in local alignment space to overcome the conventional EM´s main drawback of being trapped in a local optimum. The newly implemented algorithm is named as Monte Carlo EM Motif Discovery Algorithm (MCEMDA). MCEMDA starts from an initial model, and then it iteratively performs Monte Carlo simulation and parameter update until convergence. A log-likelihood profiling technique together with the top-k strategy is introduced to cope with the phase shifts and multiple modal issues in motif discovery problem. A novel grouping motif alignment (GMA) algorithm is designed to select motifs by clustering a population of candidate local alignments and successfully applied to subtle motif discovery. MCEMDA compares favorably to other popular PWM-based and word enumerative motif algorithms tested using simulated (l, d)-motif cases, documented prokaryotic, and eukaryotic DNA motif sequences. Finally, MCEMDA is applied to detect large blocks of conserved domains using protein benchmarks and exhibits its excellent capacity while compared with other multiple sequence alignment methods.
  • Keywords
    Monte Carlo methods; expectation-maximisation algorithm; genetics; molecular biophysics; proteins; De Novo motif discovery; Expectation Maximization; Monte Carlo EM algorithm; biomolecular sequences; genetic regulatory codes; position weight matrix; protein sequences; Biology and genetics; Expectation maximization (EM); Monte Carlo EM; Statistical computing; Stochastic processes; motif discovery; multiple sequence alignment; transcriptional regulation.; Algorithms; Amino Acid Motifs; Amino Acid Sequence; Animals; Base Sequence; Computer Simulation; DNA; Databases, Genetic; Markov Chains; Models, Molecular; Molecular Sequence Data; Monte Carlo Method; Nucleic Acid Conformation; Protein Structure, Secondary; Proteins; Sequence Analysis; Transcription Factors;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • Conference_Location
    10/10/2008 12:00:00 AM
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2008.103
  • Filename
    4641908