• DocumentCode
    1157590
  • Title

    Combining sequence and time series expression data to learn transcriptional modules

  • Author

    Kundaje, Anshul ; Middendorf, Manuel ; Gao, Feng ; Wiggins, Chris ; Leslie, Christina

  • Author_Institution
    Dept. of Comput. Sci., Columbia Univ., New York, NY, USA
  • Volume
    2
  • Issue
    3
  • fYear
    2005
  • Firstpage
    194
  • Lastpage
    202
  • Abstract
    Our goal is to cluster genes into transcriptional modules - sets of genes where similarity in expression is explained by common regulatory mechanisms at the transcriptional level. We want to learn modules from both time series gene expression data and genome-wide motif data that are now readily available for organisms such as S. cereviseae as a result of prior computational studies or experimental results. We present a generative probabilistic model for combining regulatory sequence and time series expression data to cluster genes into coherent transcriptional modules. Starting with a set of motifs representing known or putative regulatory elements (transcription factor binding sites) and the counts of occurrences of these motifs in each gene\´s promoter region, together with a time series expression profile for each gene, the learning algorithm uses expectation maximization to learn module assignments based on both types of data. We also present a technique based on the Jensen-Shannon entropy contributions of motifs in the learned model for associating the most significant motifs to each module. Thus, the algorithm gives a global approach for associating sets of regulatory elements to "modules" of genes with similar time series expression profiles. The model for expression data exploits our prior belief of smooth dependence on time by using statistical splines and is suitable for typical time course data sets with relatively few experiments. Moreover, the model is sufficiently interpretable that we can understand how both sequence data and expression data contribute to the cluster assignments, and how to interpolate between the two data sources. We present experimental results on the yeast cell cycle to validate our method and find that our combined expression and motif clustering algorithm discovers modules with both coherent expression and similar motif patterns, including binding motifs associated to known cell cycle transcription factors.
  • Keywords
    cellular biophysics; entropy; genetics; learning (artificial intelligence); microorganisms; molecular biophysics; molecular configurations; physiological models; probability; statistical analysis; time series; Jensen-Shannon entropy; cell cycle transcription factors; expectation maximization; gene sequence; generative probabilistic model; learning algorithm; motif clustering algorithm; statistical splines; time series gene expression; transcription factor binding sites; transcriptional modules; yeast cell cycle; Bioinformatics; Clustering algorithms; Computational biology; Data analysis; Entropy; Fungi; Gene expression; Genomics; Organisms; Sequences; Index Terms- Gene regulation; clustering; heterogeneous data.; Algorithms; Artificial Intelligence; Computer Simulation; Gene Expression Profiling; Gene Expression Regulation; Models, Genetic; Models, Statistical; Multigene Family; Oligonucleotide Array Sequence Analysis; Pattern Recognition, Automated; Sequence Analysis, DNA; Time Factors; Transcription Factors;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2005.34
  • Filename
    1504684