• DocumentCode
    1995467
  • Title

    Secondary Structure Predictions for Long RNA Sequences Based on Inversion Excursions and MapReduce

  • Author

    Yehdego, Daniel T. ; Boyu Zhang ; Kodimala, Vikram K. R. ; Johnson, Kyle L. ; Taufer, Michela ; Ming-Ying Leung

  • Author_Institution
    Univ. of Texas at El Paso, El Paso, TX, USA
  • fYear
    2013
  • fDate
    20-24 May 2013
  • Firstpage
    520
  • Lastpage
    529
  • Abstract
    Secondary structures of ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Experimental observations and computing limitations suggest that we can approach the secondary structure prediction problem for long RNA sequences by segmenting them into shorter chunks, predicting the secondary structures of each chunk individually using existing prediction programs, and then assembling the results to give the structure of the original sequence. The selection of cutting points is a crucial component of the segmenting step. Noting that stem-loops and pseudo knots always contain an inversion, i.e., a stretch of nucleotides followed closely by its inverse complementary sequence, we developed two cutting methods for segmenting long RNA sequences based on inversion excursions: the centered and optimized method. Each step of searching for inversions, chunking, and predictions can be performed in parallel. In this paper we use a MapReduce framework, i.e., Hadoop, to extensively explore meaningful inversion stem lengths and gap sizes for the segmentation and identify correlations between chunking methods and prediction accuracy. We show that for a set of long RNA sequences in the RFAM database, whose secondary structures are known to contain pseudo knots, our approach predicts secondary structures more accurately than methods that do not segment the sequence, when the latter predictions are possible computationally. We also show that, as sequences exceed certain lengths, some programs cannot computationally predict pseudo knots while our chunking methods can. Overall, our predicted structures still retain the accuracy level of the original prediction programs when compared with known experimental secondary structure.
  • Keywords
    RNA; biology computing; parallel programming; Hadoop; MapReduce framework; RFAM database; RNA molecules; biological processes; centered optimized method; chunking methods; chunking search; cutting point selection; gene expression; gene regulation; inverse complementary sequence; inversion excursions; inversion search; long RNA sequence segmention; nucleotides; prediction program accuracy level; prediction search; pseudoknots; ribonucleic acid molecules; secondary structure prediction problem; stem-loops; Accuracy; Biological processes; Databases; Gene expression; Genomics; Prediction algorithms; RNA; Hadoop; Perfor- mance analysis; Prediction accuracy; Pseudoknots; RNA segmentation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International
  • Conference_Location
    Cambridge, MA
  • Print_ISBN
    978-0-7695-4979-8
  • Type

    conf

  • DOI
    10.1109/IPDPSW.2013.109
  • Filename
    6650927