• DocumentCode
    3408955
  • Title

    Comparison of two schemes for automatic keyword extraction from MEDLINE for functional gene clustering

  • Author

    Liu, Ying ; Ciliax, Brian J. ; Borges, Karin ; Dasigi, Venu ; Ram, Ashwin ; Navathe, Shamkant B. ; Dingledine, Ray

  • Author_Institution
    Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA, USA
  • fYear
    2004
  • fDate
    16-19 Aug. 2004
  • Firstpage
    394
  • Lastpage
    404
  • Abstract
    One of the key challenges of microarray studies is to derive biological insights from the unprecedented quantities of data on gene-expression patterns. Clustering genes by functional keyword association can provide direct information about the nature of the functional links among genes within the derived clusters. However, the quality of the keyword lists extracted from biomedical literature for each gene significantly affects the clustering results. We extracted keywords from MEDLINE that describe the most prominent functions of the genes, and used the resulting weights of the keywords as feature vectors for gene clustering. By analyzing the resulting cluster quality, we compared two keyword weighting schemes: normalized z-score and term frequency-inverse document frequency (TFIDF). The best combination of background comparison set, stop list and stemming algorithm was selected based on precision and recall metrics. In a test set of four known gene groups, a hierarchical algorithm correctly assigned 25 of 26 genes to the appropriate clusters based on keywords extracted by the TDFIDF weighting scheme, but only 23 of 26 with the z-score method. To evaluate the effectiveness of the weighting schemes for keyword extraction for gene clusters from microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle were used as a second test set. Using established measures of cluster quality, the results produced from TFIDF-weighted keywords had higher purity, lower entropy, and higher mutual information than those produced from normalized z-score weighted keywords. The optimized algorithms should be useful for sorting genes from microarray lists into functionally discrete clusters.
  • Keywords
    biology computing; cellular biophysics; entropy; genetics; information analysis; optimisation; pattern clustering; MEDLINE; automatic keyword extraction; background comparison set; biomedical literature; cell cycle; feature vectors; functional gene clustering; functional keyword association; gene-expression patterns; high mutual information; high purity; low entropy; microarray; normalized z-score; optimized algorithms; stemming algorithm; stop list; term frequency-inverse document frequency; weighting schemes; yeast genes; Abstracts; Biomedical measurements; Clustering algorithms; Data mining; Educational institutions; Frequency; Fungi; Nervous system; Testing; Venus;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Systems Bioinformatics Conference, 2004. CSB 2004. Proceedings. 2004 IEEE
  • Print_ISBN
    0-7695-2194-0
  • Type

    conf

  • DOI
    10.1109/CSB.2004.1332452
  • Filename
    1332452