Title :
Parallel analysis of large graph-structured data in genomics and proteomics
Author_Institution :
Dept. of Comput. Sci. & Eng., Pennsylvania State Univ., University Park, PA, USA
Abstract :
This talk presents two new software tools that our research group is developing: lark, a de novo genome assembler, and FASCIA, a tool to approximately count and enumerate network motifs. Our de novo genome assembly approach involves construction, traversal, and simplification of de Bruijn graphs on large multicore clusters. We present new parallelization strategies and optimizations for various steps of the assembly process, focusing on techniques for scalable de Bruijn graph construction from short-read data. We demonstrate the parallel efficiency and scalability of our assembler with performance results and analysis of several recent metagenomic data sets. The problem of subgraph counting refers to determining the frequency of occurrence of a given subgraph or template within a large network. Subgraph counting is used to detect and characterize local structure (network motifs) in several biological networks, including protein interaction networks and metabolic networks. Exhaustive enumeration and exact counting is extremely compute-intensive, with running time growing exponentially with the number of vertices in the template. In FASCIA, we combine parallelism with the color coding technique to determine approximate counts of non-induced occurrences of the subgraph in the original network. The color coding technique gives a fast approximate algorithm for this problem, using a dynamic programming-based counting approach. Our new contributions are a multilevel shared-memory parallelization of the counting scheme and several optimizations to reduce the memory footprint. We show that approximate counts can be obtained for templates with up to 12 vertices, on networks with up to millions of vertices and edges.
Keywords :
biochemistry; biology computing; genomics; optimisation; proteins; proteomics; FASCIA; assembly process; biological networks; color coding technique; de Bruijn graph construction; de novo genome assembly approach; dynamic programming-based counting approach; fast approximate algorithm; genomics; large graph-structured data; large multicore clusters; local structure; memory footprint; metabolic networks; metagenomic data sets; multilevel shared-memory parallelization; optimizations; parallel analysis; parallel efficiency; parallel scalability; parallelization strategies; protein interaction networks; proteomics; short-read data; software tools; subgraph counting problem; Assembly; Bioinformatics; Educational institutions; Fascia; Genomics; Image color analysis; Optimization;
Conference_Titel :
Computational Advances in Bio and Medical Sciences (ICCABS), 2013 IEEE 3rd International Conference on
Conference_Location :
New Orleans, LA
DOI :
10.1109/ICCABS.2013.6629240