Title :
Identifying Complex Biological Interactions based on Categorical Gene Expression Data
Author :
Goertzel, Ben ; Pennachin, Cassio ; de Souza Coelho, Lucio Souza ; Mudado, Mauricio
Author_Institution :
Biomind LLC, Rockville
Abstract :
A novel method, MUTIC (model utilization-based clustering), is described for identifying complex interactions between genes or gene-categories based on gene expression data. The method deals with binary categorical data, which consists of a set of gene expression profiles divided into two biologically meaningful categories. It does not require data from multiple time points. Gene expression profiles are represented by feature vectors whose component features are either gene expression values, or averaged expression values corresponding to gene ontology or protein information resource categories. A supervised learning algorithm (genetic programming) is used to learn an ensemble of classification models distinguishing the two categories based on the feature vectors corresponding to their members. Each feature is associated with a "model utilization vector," which has an entry for each high-quality classification model found, indicating whether or not the feature was used in that model. These utilization vectors are then clustered using a variant of hierarchical clustering called Omniclust. The result is a set of model-utilization-based clusters, in which features are gathered together if they are often considered together by classification models - which may be because they\´re co-expressed, or may be for subtler reasons involving multi-gene interactions. The MUTIC method is illustrated via applying it to a dataset regarding gene expression in human brains of various ages. Compared to traditional expression-based clustering, MUTIC yields clusters that have higher mathematical quality (in the sense of homogeneity and separation) and also yield novel insights into the underlying biological processes.
Keywords :
biology computing; genetic algorithms; genetics; learning (artificial intelligence); pattern clustering; binary categorical data; biological process; categorical gene expression data; classification model; complex biological interactions; feature vector; gene expression profile; gene ontology; gene-categories; genetic programming; model utilization vector; model utilization-based clustering; protein information resource category; supervised learning algorithm; Biological interactions; Biological processes; Biological system modeling; Gene expression; Genetic programming; Humans; Information resources; Ontologies; Proteins; Supervised learning;
Conference_Titel :
Evolutionary Computation, 2006. CEC 2006. IEEE Congress on
Conference_Location :
Vancouver, BC
Print_ISBN :
0-7803-9487-9
DOI :
10.1109/CEC.2006.1688477