Combining sequence and time series expression data to learn transcriptional modules

Author

Kundaje, Anshul ; Middendorf, Manuel ; Gao, Feng ; Wiggins, Chris ; Leslie, Christina

Author_Institution

Dept. of Comput. Sci., Columbia Univ., New York, NY, USA

Volume

Issue

fYear

2005

Firstpage

194

Lastpage

202

Abstract

Our goal is to cluster genes into transcriptional modules - sets of genes where similarity in expression is explained by common regulatory mechanisms at the transcriptional level. We want to learn modules from both time series gene expression data and genome-wide motif data that are now readily available for organisms such as S. cereviseae as a result of prior computational studies or experimental results. We present a generative probabilistic model for combining regulatory sequence and time series expression data to cluster genes into coherent transcriptional modules. Starting with a set of motifs representing known or putative regulatory elements (transcription factor binding sites) and the counts of occurrences of these motifs in each gene\´s promoter region, together with a time series expression profile for each gene, the learning algorithm uses expectation maximization to learn module assignments based on both types of data. We also present a technique based on the Jensen-Shannon entropy contributions of motifs in the learned model for associating the most significant motifs to each module. Thus, the algorithm gives a global approach for associating sets of regulatory elements to "modules" of genes with similar time series expression profiles. The model for expression data exploits our prior belief of smooth dependence on time by using statistical splines and is suitable for typical time course data sets with relatively few experiments. Moreover, the model is sufficiently interpretable that we can understand how both sequence data and expression data contribute to the cluster assignments, and how to interpolate between the two data sources. We present experimental results on the yeast cell cycle to validate our method and find that our combined expression and motif clustering algorithm discovers modules with both coherent expression and similar motif patterns, including binding motifs associated to known cell cycle transcription factors.

Keywords

cellular biophysics; entropy; genetics; learning (artificial intelligence); microorganisms; molecular biophysics; molecular configurations; physiological models; probability; statistical analysis; time series; Jensen-Shannon entropy; cell cycle transcription factors; expectation maximization; gene sequence; generative probabilistic model; learning algorithm; motif clustering algorithm; statistical splines; time series gene expression; transcription factor binding sites; transcriptional modules; yeast cell cycle; Bioinformatics; Clustering algorithms; Computational biology; Data analysis; Entropy; Fungi; Gene expression; Genomics; Organisms; Sequences; Index Terms- Gene regulation; clustering; heterogeneous data.; Algorithms; Artificial Intelligence; Computer Simulation; Gene Expression Profiling; Gene Expression Regulation; Models, Genetic; Models, Statistical; Multigene Family; Oligonucleotide Array Sequence Analysis; Pattern Recognition, Automated; Sequence Analysis, DNA; Time Factors; Transcription Factors;

fLanguage

English

Journal_Title

Computational Biology and Bioinformatics, IEEE/ACM Transactions on

Publisher

ieee

ISSN

1545-5963

Type

jour

DOI

10.1109/TCBB.2005.34

Filename

1504684

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=1157590