Title :
An optimal DNA segmentation based on the MDL principle
Author :
Szpankowski, Wojciech ; Ren, Wenhui ; Szpankowski, Lukasz
Author_Institution :
Dept. of Comput. Sci., Purdue Univ., West Lafayette, IN, USA
Abstract :
The biological world is highly stochastic as well as inhomogeneous in its behavior. The transition between homogeneous and inhomogeneous regions of DNA, known also as change points, carry important biological information. Our goal is to employ rigorous methods of information theory to quantify structural properties of DNA sequences. In particular, we adopt the Stein-Ziv lemma to find asymptotically optimal discriminant function that determines whether two DNA segments are generated by the same source and assuring exponentially small false positives. Then we apply the minimum description length (MDL) principle to select parameters of our segmentation algorithm. Finally, we perform extensive experimental work on human chromosome 9. After grouping A and G (purines) and T and C (pyrimidines) we discover change points between coding and noncoding regions as well as the beginning of a CpG island.
Keywords :
DNA; biology computing; cellular biophysics; data compression; genetics; information theory; molecular biophysics; DNA sequences; Stein-Ziv lemma; homogeneous regions; human chromosome; information theory; inhomogeneous regions; minimum description length principle; optimal DNA segmentation; optimal discriminant function; purines; pyrimidines; segmentation algorithm; Biological information theory; Biology; Cells (biology); DNA; Data compression; Genetic communication; Hidden Markov models; Image coding; Information theory; Sequences;
Conference_Titel :
Bioinformatics Conference, 2003. CSB 2003. Proceedings of the 2003 IEEE
Print_ISBN :
0-7695-2000-6
DOI :
10.1109/CSB.2003.1227402