DocumentCode :
167287
Title :
Latent Dirichlet Allocation based on Gibbs Sampling for gene function prediction
Author :
Pinoli, Pietro ; Chicco, Davide ; Masseroli, Marco
Author_Institution :
Dipt. di Elettron. Inf. e Bioingegneria, Politec. di Milano, Milan, Italy
fYear :
2014
fDate :
21-24 May 2014
Firstpage :
1
Lastpage :
8
Abstract :
Gene function annotations are key elements in biology and bioinformatics. A typical annotation is the association between a gene and a feature term that describes a functional feature of the gene by using a controlled vocabulary term (e.g. a Gene Ontology (GO) feature term). Unfortunately, available annotations contain errors and biologically validated ones are incomplete by definition, since new knowledge is continuously discovered. Thus, computational algorithms which are able to provide ranked lists of predicted new gene annotations are an excellent contribution to the bioinformatics research. Here, we propose two variants of the known Latent Dirichlet Allocation (LDA) algorithm applied to the prediction of gene annotations. LDA is a very efficient machine learning method built on a set of multinomial probability distributions over a set of topics, given a document (a gene, in our case), and on a set of multinomial probability distributions over a set of words (feature terms, in our case), given a topic. In topic modeling, a topic can be considered as a latent meta-category of words, and a document as a mixture of topics. Our two LDA variants use the collapsed Gibbs Sampling method during the training phase, with two distinct initialization approaches to adapt the LDA mathematical model to the biomolecular annotation scenario. Using six outdated datasets of GO annotations of human and brown rat genes, we compared the annotations predicted by our methods to the ones given by the truncated Singular Value Decomposition (tSVD) method previously developed; then, we validated them by using the annotations available in an updated version of the same datasets. Obtained results show the efficiency of our new proposed algorithms.
Keywords :
bioinformatics; genomics; learning (artificial intelligence); ontologies (artificial intelligence); sampling methods; LDA algorithm; LDA variants; collapsed Gibbs sampling method; controlled vocabulary term; feature terms; gene annotation prediction; gene function annotations; gene function prediction; gene functional feature; gene ontology feature term; gene-feature term association; latent Dirichlet allocation; latent word metacategory; machine learning method; multinomial probability distributions; tSVD comparison; truncated singular value decomposition; Bioinformatics; Ontologies; Prediction algorithms; Probability distribution; Resource management; Semantics; Vectors;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Intelligence in Bioinformatics and Computational Biology, 2014 IEEE Conference on
Conference_Location :
Honolulu, HI
Type :
conf
DOI :
10.1109/CIBCB.2014.6845514
Filename :
6845514
Link To Document :
بازگشت