DocumentCode :
1882728
Title :
Big data biology-based predictive models Via DNA-metagenomics binning for WMD events applications
Author :
Saghir, Helal ; Megherbi, Dalila B.
Author_Institution :
Dept. of Electr. & Comput., Univ. of Massachusetts, Lowell, MA, USA
fYear :
2015
fDate :
14-16 April 2015
Firstpage :
1
Lastpage :
6
Abstract :
In WMD events or natural disasters, identifying bio-chemicals and microorganisms rapidly is crucial. Metagenomics is the study of microorganisms collected directly from natural environments using whole genome shotgun (WGS) sequencing. Metagenomics methods allow sequencing of organism genomes which cannot be cultured in a laboratory. Grouping random fragments obtained from whole shotgun genome data into groups is known as binning. Metagenomics methods allow quick sequencing of microbes obtained from natural disaster sites to identify microbes and provide rapid and timely response, in terms, for examples, for rapid environment cleanup/restoration, rapid quarantine of objects/animals/humans, recovery, etc. In this paper we propose machine learning related predictive DNA sequence feature selection algorithms to solve binning problems in more accurate and efficient ways. Here we use sub-sequences blocks extracted from organism protein domains as features. We analyze and compare binning prediction results obtained by using k-mers, by using codons, and by using sub-sequences blocks derived from conserved protein domains. We show here, that sub-sequences blocks derived from conserved protein domains give better prediction accuracy than k-mers or codons. We also showed comparative analysis of binning predictive models using Naïve Bayes Classifier and Random Forest Classifier with feature set derived from conserved protein domain. Our analysis shows that using the Random Forest classifier, results in better classification accuracy than using the Naïve Bayes classifier.
Keywords :
Bayes methods; Big Data; DNA; biology computing; disasters; genomics; learning (artificial intelligence); pattern classification; proteins; DNA-metagenomics binning; WGS; WMD events applications; big data biology-based predictive models; biochemicals; codons; conserved protein domain; conserved protein domains; k-mers; machine learning related predictive DNA sequence feature selection algorithms; naïve Bayes classifier; natural disaster sites; natural disasters; natural environments; organism genomes; organism protein domains; random forest classifier; random fragments; rapid environment cleanup-restoration; rapid quarantine; shotgun genome data; whole genome shotgun sequencing; Accuracy; Bioinformatics; DNA; Feature extraction; Genomics; Organisms; Proteins; Machine learning; bagged decision tree; binning; bioinformatics; codon; conserve protein domain; forwaord sequential feature selection; k-mers; metagenomics; next generation sequencing; random forest; reduction methods;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Technologies for Homeland Security (HST), 2015 IEEE International Symposium on
Conference_Location :
Waltham, MA
Print_ISBN :
978-1-4799-1736-5
Type :
conf
DOI :
10.1109/THS.2015.7225313
Filename :
7225313
Link To Document :
بازگشت