DocumentCode :
19565
Title :
Improving Retrieval Efficacy of Homology Searches Using the False Discovery Rate
Author :
Carroll, Hyrum D. ; Williams, Alex C. ; Davis, Anthony G. ; Spouge, John L.
Author_Institution :
Dept. of Comput. Sci., Middle Tennessee State Univ., Murfreesboro, TN, USA
Volume :
12
Issue :
3
fYear :
2015
fDate :
May-June 1 2015
Firstpage :
531
Lastpage :
537
Abstract :
Over the past few decades, discovery based on sequence homology has become a widely accepted practice. Consequently, comparative accuracy of retrieval algorithms (e.g., BLAST) has been rigorously studied for improvement. Unlike most components of retrieval algorithms, the E-value threshold criterion has yet to be thoroughly investigated. An investigation of the threshold is important as it exclusively dictates which sequences are declared relevant and irrelevant. In this paper, we introduce the false discovery rate (FDR) statistic as a replacement for the uniform threshold criterion in order to improve efficacy in retrieval systems. Using NCBI´s BLAST and PSI-BLAST software packages, we demonstrate the applicability of such a replacement in both non-iterative (BLASTFDR) and iterative (PSI-BLASTFDR) homology searches. For each application, we performed an evaluation of retrieval efficacy with five different multiple testing methods on a large training database. For each algorithm, we choose the best performing method, Benjamini-Hochberg, as the default statistic. As measured by the threshold average precision, BLASTFDR yielded 14.1 percent better retrieval performance than BLAST on a large (5,161 queries) test database and PSI-BLASTFDR attained 11.8 percent better retrieval performance than PSI-BLAST. The C++ source code specific to BLASTFDR and PSI-BLASTFDR and instructions are available at http://www.cs.mtsu.edu/~hcarroll/blast_fdr/.
Keywords :
C++ language; bioinformatics; information retrieval; iterative methods; molecular configurations; proteins; query formulation; software packages; ++ source code; BLAST software package; Benjamini-Hochberg method; E-value threshold criterion; FDR statistic; PSI-BLAST software package; false discovery rate; homology searches; iterative homology searches; noniterative homology searches; retrieval algorithms; retrieval efficacy; sequence homology; threshold average precision; uniform threshold criterion; Bioinformatics; Computational biology; Databases; Histograms; IEEE transactions; Testing; Training; Homology search; false discovery rate; retrieval efficacy; uniform E-value thresholding;
fLanguage :
English
Journal_Title :
Computational Biology and Bioinformatics, IEEE/ACM Transactions on
Publisher :
ieee
ISSN :
1545-5963
Type :
jour
DOI :
10.1109/TCBB.2014.2366112
Filename :
6940294
Link To Document :
بازگشت