• DocumentCode
    653958
  • Title

    Consensus Sigma-70 Promoter Prediction Using Hadoop

  • Author

    Hogan, James M. ; Kelly, Wayne A. ; Newell, Felicity S.

  • fYear
    2013
  • fDate
    22-25 Oct. 2013
  • Firstpage
    35
  • Lastpage
    44
  • Abstract
    MapReduce frameworks such as Hadoop are well suited to handling large sets of data which can be processed separately and independently, with canonical applications in information retrieval and sales record analysis. Rapid advances in sequencing technology have ensured an explosion in the availability of genomic data, with a consequent rise in the importance of large scale comparative genomics, often involving operations and data relationships which deviate from the classical Map Reduce structure. This work examines the application of Hadoop to patterns of this nature, using as our focus a well established workflow for identifying promoters - binding sites for regulatory proteins - across multiple gene regions and organisms, coupled with the unifying step of assembling these results into a consensus sequence. Our approach demonstrates the utility of Hadoop for problems of this nature, showing how the tyranny of the "dominant decomposition" can be at least partially overcome. It also demonstrates how load balance and the granularity of parallelism can be optimized by pre-processing that splits and reorganizes input files, allowing a wide range of related problems to be brought under the same computational umbrella.
  • Keywords
    biology computing; data handling; genomics; parallel programming; proteins; public domain software; Hadoop; MapReduce frameworks; binding sites; consensus Sigma-70 promoter prediction; data relationships; dominant decomposition; genomic data; information retrieval; large data set handling; large scale comparative genomics; multiple gene regions; multiple organisms; parallelism granularity; promoter identification; regulatory proteins; sales record analysis; sequencing technology; Bioinformatics; Context; DNA; Genomics; Java; Organisms; Proteins; Bioinformatics; Hadoop; Map Reduce; Promoter Prediction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    eScience (eScience), 2013 IEEE 9th International Conference on
  • Conference_Location
    Beijing
  • Type

    conf

  • DOI
    10.1109/eScience.2013.42
  • Filename
    6683889