Author/Authors :
Huang, Po-Jung Department of Biomedical Sciences - Chang Gung University - Taoyuan, Taiwan , Chang, Jui-Huan Chang Gung University - Taoyuan, Taiwan , Lin, Hou-Hsien National Tsing Hua University - Hsinchu, Taiwan , Li, Yu-Xuan Chang Gung University - Taoyuan, Taiwan , Lee, Chi-Ching Department of Computer Science and Information Engineering - Chang Gung University - Taoyuan, Taiwan , Su, Chung-Tsai Department of Biomedical Sciences - Chang Gung University - Taoyuan, Taiwan , Li, Yun-Lung Department of Biomedical Sciences - Chang Gung University - Taoyuan, Taiwan , Chang, Ming-Tai Department of Biomedical Sciences - Chang Gung University - Taoyuan, Taiwan , Weng, Sid Department of Biomedical Sciences - Chang Gung University - Taoyuan, Taiwan , Cheng, Wei-Hung Department of Parasitology - College of Medicine - Chang Gung University - Taoyuan, Taiwan , Chiu, Cheng-Hsun Chang Gung Memorial Hospital - Linkou, Taiwan , Tang, Petrus Department of Parasitology - College of Medicine - Chang Gung University - Taoyuan, Taiwan
Abstract :
Although sequencing a human genome has become affordable, identifying genetic variants from whole-genome sequence data is
still a hurdle for researchers without adequate computing equipment or bioinformatics support. GATK is a gold standard
method for the identification of genetic variants and has been widely used in genome projects and population genetic studies for
many years. This was until the Google Brain team developed a new method, DeepVariant, which utilizes deep neural networks
to construct an image classification model to identify genetic variants. However, the superior accuracy of DeepVariant comes at
the cost of computational intensity, largely constraining its applications. Accordingly, we present DeepVariant-on-Spark to
optimize resource allocation, enable multi-GPU support, and accelerate the processing of the DeepVariant pipeline. To make
DeepVariant-on-Spark more accessible to everyone, we have deployed the DeepVariant-on-Spark to the Google Cloud Platform
(GCP). Users can deploy DeepVariant-on-Spark on the GCP following our instruction within 20 minutes and start to analyze at
least ten whole-genome sequencing datasets using free credits provided by the GCP. DeepVaraint-on-Spark is freely available
for small-scale genome analysis using a cloud-based computing framework, which is suitable for pilot testing or preliminary
study, while reserving the flexibility and scalability for large-scale sequencing projects.
Keywords :
DeepVariant-on-Spark , Analysis , Cloud-Based , GATK