• DocumentCode
    4349
  • Title

    Information Theory of DNA Shotgun Sequencing

  • Author

    Motahari, Abolfazl S. ; Bresler, Guy ; Tse, David N. C.

  • Author_Institution
    Dept. of Electr. Eng. & Comput. Sci., Univ. of California, Berkeley, Berkeley, CA, USA
  • Volume
    59
  • Issue
    10
  • fYear
    2013
  • fDate
    Oct. 2013
  • Firstpage
    6273
  • Lastpage
    6289
  • Abstract
    DNA sequencing is the basic workhorse of modern day biology and medicine. Shotgun sequencing is the dominant technique used: many randomly located short fragments called reads are extracted from the DNA sequence, and these reads are assembled to reconstruct the original sequence. A basic question is: given a sequencing technology and the statistics of the DNA sequence, what is the minimum number of reads required for reliable reconstruction? This number provides a fundamental limit to the performance of any assembly algorithm. For a simple statistical model of the DNA sequence and the read process, we show that the answer admits a critical phenomenon in the asymptotic limit of long DNA sequences: if the read length is below a threshold, reconstruction is impossible no matter how many reads are observed, and if the read length is above the threshold, having enough reads to cover the DNA sequence is sufficient to reconstruct. The threshold is computed in terms of the Renyi entropy rate of the DNA sequence. We also study the impact of noise in the read process on the performance.
  • Keywords
    DNA; biological techniques; entropy; information theory; molecular biophysics; statistics; DNA shotgun sequencing; Renyi entropy rate; information theory; noise impact; statistics; Algorithm design and analysis; Assembly; Bioinformatics; DNA; Genomics; Greedy algorithms; Sequential analysis; DNA sequencing; de novo assembly; information theory;
  • fLanguage
    English
  • Journal_Title
    Information Theory, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9448
  • Type

    jour

  • DOI
    10.1109/TIT.2013.2270273
  • Filename
    6544699