مرکز منطقه ای اطلاع رساني علوم و فناوري - Parallel analysis of large graph-structured data in genomics and proteomics

DocumentCode :

3496482

Title :

Parallel analysis of large graph-structured data in genomics and proteomics

Author :

Madduri, Kamesh

Author_Institution :

Dept. of Comput. Sci. & Eng., Pennsylvania State Univ., University Park, PA, USA

fYear :

2013

fDate :

12-14 June 2013

Firstpage :

Lastpage :

Abstract :

This talk presents two new software tools that our research group is developing: lark, a de novo genome assembler, and FASCIA, a tool to approximately count and enumerate network motifs. Our de novo genome assembly approach involves construction, traversal, and simplification of de Bruijn graphs on large multicore clusters. We present new parallelization strategies and optimizations for various steps of the assembly process, focusing on techniques for scalable de Bruijn graph construction from short-read data. We demonstrate the parallel efficiency and scalability of our assembler with performance results and analysis of several recent metagenomic data sets. The problem of subgraph counting refers to determining the frequency of occurrence of a given subgraph or template within a large network. Subgraph counting is used to detect and characterize local structure (network motifs) in several biological networks, including protein interaction networks and metabolic networks. Exhaustive enumeration and exact counting is extremely compute-intensive, with running time growing exponentially with the number of vertices in the template. In FASCIA, we combine parallelism with the color coding technique to determine approximate counts of non-induced occurrences of the subgraph in the original network. The color coding technique gives a fast approximate algorithm for this problem, using a dynamic programming-based counting approach. Our new contributions are a multilevel shared-memory parallelization of the counting scheme and several optimizations to reduce the memory footprint. We show that approximate counts can be obtained for templates with up to 12 vertices, on networks with up to millions of vertices and edges.

Keywords :

biochemistry; biology computing; genomics; optimisation; proteins; proteomics; FASCIA; assembly process; biological networks; color coding technique; de Bruijn graph construction; de novo genome assembly approach; dynamic programming-based counting approach; fast approximate algorithm; genomics; large graph-structured data; large multicore clusters; local structure; memory footprint; metabolic networks; metagenomic data sets; multilevel shared-memory parallelization; optimizations; parallel analysis; parallel efficiency; parallel scalability; parallelization strategies; protein interaction networks; proteomics; short-read data; software tools; subgraph counting problem; Assembly; Bioinformatics; Educational institutions; Fascia; Genomics; Image color analysis; Optimization;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computational Advances in Bio and Medical Sciences (ICCABS), 2013 IEEE 3rd International Conference on

Conference_Location :

New Orleans, LA

Type :

conf

DOI :

10.1109/ICCABS.2013.6629240

Filename :

6629240

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3496482