• DocumentCode
    1072008
  • Title

    Microarray Gene Cluster Identification and Annotation Through Cluster Ensemble and EM-Based Informative Textual Summarization

  • Author

    Hu, Xiaohua ; Park, E.K. ; Zhang, Xiaodan

  • Author_Institution
    Henan Univ., Kaifeng, China
  • Volume
    13
  • Issue
    5
  • fYear
    2009
  • Firstpage
    832
  • Lastpage
    840
  • Abstract
    Generating high-quality gene clusters and identifying the underlying biological mechanism of the gene clusters are the important goals of clustering gene expression analysis. To get high-quality cluster results, most of the current approaches rely on choosing the best cluster algorithm, in which the design biases and assumptions meet the underlying distribution of the dataset. There are two issues for this approach: 1) usually, the underlying data distribution of the gene expression datasets is unknown and 2) there are so many clustering algorithms available and it is very challenging to choose the proper one. To provide a textual summary of the gene clusters, the most explored approach is the extractive approach that essentially builds upon techniques borrowed from the information retrieval, in which the objective is to provide terms to be used for query expansion, and not to act as a stand-alone summary for the entire document sets. Another drawback is that the clustering quality and cluster interpretation are treated as two isolated research problems and are studied separately. In this paper, we design and develop a unified system Gene Expression Miner to address these challenging issues in a principled and general manner by integrating cluster ensemble, text clustering, and multidocument summarization and provide an environment for comprehensive gene expression data analysis. We present a novel cluster ensemble approach to generate high-quality gene cluster. In our text summarization module, given a gene cluster, our expectation-maximization based algorithm can automatically identify subtopics and extract most probable terms for each topic. Then, the extracted top k topical terms from each subtopic are combined to form the biological explanation of each gene cluster. Experimental results demonstrate that our system can obtain high-quality clusters and provide informative key terms for the gene clusters.
  • Keywords
    bioinformatics; genetics; information retrieval; pattern clustering; statistical analysis; Gene Expression Miner; annotation; extractive approach; information retrieval; informative textual summarization; microarray gene cluster identification; query; Cluster ensemble; expectation–maximization (EM); microarray gene expression analysis; text mining; Algorithms; Cluster Analysis; Databases, Genetic; Genes, Fungal; Models, Statistical; Multigene Family; Oligonucleotide Array Sequence Analysis; Vocabulary, Controlled; Yeasts;
  • fLanguage
    English
  • Journal_Title
    Information Technology in Biomedicine, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1089-7771
  • Type

    jour

  • DOI
    10.1109/TITB.2009.2023984
  • Filename
    5072272