• DocumentCode
    2075058
  • Title

    sBGMM: A Stratified Beta-Gaussian Mixture Model for Clustering Genes with Multiple Data Sources

  • Author

    Dai, Xiaofeng ; Lahdesmaki, Harri ; Yli-Harja, Olli

  • Author_Institution
    Dept. of Signal Process., Tampere Univ. of Technol., Tampere
  • fYear
    2008
  • fDate
    June 29 2008-July 5 2008
  • Firstpage
    94
  • Lastpage
    99
  • Abstract
    Cluster analysis is widely applied to discover the function of previously unannotated genes. This paper presents a novel stratified beta-Gaussian mixture model, sBGMM, for clustering genes based on gene expression data, protein-DNA binding data and data that can provide information for constructing priors such as protein-protein interaction (PPI) data. An expectation maximization (EM) type of algorithm for Beta mixture model is first developed and then combined with that of Gaussian mixture model. This combined algorithm can jointly estimate the parameters for both Beta and Gaussian distributions and is used as the core in the sBGMM method. The stratification property of sBGMM is exhibited as Stratum-specific prior probabilities and is constructed by the pre-cluster results obtained from PPI data in this study. This proposed sBGMM method differs from other mixture model based methods in its integration of two different data types into a single and unified probabilistic modeling framework and incorporation of prior information from a third data source. Several well-studied model selection methods, such as Akaike information criterion (AIC), modified AIC (AIC3), Bayesian information criterion (BIC), and integrated classification likelihood-BIC (ICL-BIC) are applied to estimate the number of clusters, and simulation results show that AIC3 works best for sBGMM. Simulations also indicate that combining two different data sources into a single mixture model can greatly improve the clustering accuracy and stability, and employing priors to stratify the model can further enhance its performance. This proposed method provides a more efficient use of multiple data sources than methods that analyze different data sources separately.
  • Keywords
    Gaussian distribution; biology computing; expectation-maximisation algorithm; genetic engineering; molecular biophysics; parameter estimation; probability; proteins; statistical analysis; AIC3; Akaike information criterion; Bayesian information criterion; Beta distribution parameter estimation; Gaussian distribution parameter estimation; ICL-BIC; Stratum specific prior probabilities; cluster analysis; cluster number estimation; expectation maximization type algorithm; gene clustering; gene expression data; gene function discovery; integrated classification likelihood-BIC; modified AIC; multiple data sources; probabilistic modeling framework; protein-DNA binding data; protein-protein interaction data; sBGMM stratification property; stratified beta Gaussian mixture model; Bayesian methods; Bioinformatics; Biological system modeling; Biomedical signal processing; Clustering algorithms; Power system modeling; Proteins; Signal analysis; Signal processing algorithms; Stability; BGMM (Beta-Gaussian mixture model); BMM (Beta mixture model); EM(Expectation maximization); GMM (Gaussian mixture model); sBGMM (stratified Beta-Gaussian mixture model);
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Biocomputation, Bioinformatics, and Biomedical Technologies, 2008. BIOTECHNO '08. International Conference on
  • Conference_Location
    Bucharest
  • Print_ISBN
    978-0-7695-3191-5
  • Electronic_ISBN
    978-0-7695-3191-5
  • Type

    conf

  • DOI
    10.1109/BIOTECHNO.2008.12
  • Filename
    4561141