Author_Institution :
Dept. of Electr. & Comput. Eng., Univ. of Manitoba, Winnipeg, MB, Canada
Abstract :
Summary form only given. Deoxyribonucleic acid (DNA) has become one of the most examined molecules on the planet. Scientist around the world have been trying to unravel its secrets for many purposes. For example, genetic information is currently used to raise better plants and animals, create enhanced pharmaceuticals for humans, and for gene therapy in medicine. Science as a whole has benefited from the study of genetics because of the increased understanding of biological process that all organisms share. In recent decades, a significant amount of research has been directed towards sequencing and understanding the entire human genome through the Human Genome Project (HGP) launched in 1986. The goal of the HGP was to find the location of the approximately 1×105 human genes, and read all the sequence of human genome (about 3×109 base pairs, bp). An exponential grow rate of that research has resulted in reaching the goal by 2003. Similarly, the speed of finding genes and their locations is also increasing rapidly. On the other hand, the traditional methods of finding genes and their location at chromatosomes through testing their biological function have been inherently slow. Although numerous faster techniques have been developed, there is still a need to augment them with new approaches. Therefore, robust computational solutions to the gene-finding problem could provide a valuable resource for the HGP and for the molecular-biology community. Most of the current research in the deciphering the meaning of DNA sequences is approached from the lowest base-pair level. Its main objective is to search for patterns or correlations existing in the DNA sequence related to codons, amino acids, and proteins. A number of gene-finding systems have been developed in recent decades. These systems use a variety of sophisticated computational data-miming techniques, including neural networks, dynamic programming, rule-based methods, decision trees, probability reasoni- - ng, hidden Markov chains, genetic programming, and support vector machines. Most of these approaches are based on local measures only. In addition, many of the techniques rely on the statistical qualities of exons in the gene, thus using only the known gene pool as a training set for their classification. Although the techniques have demonstrated limited success, better techniques should be developed. An approach to finding such improved techniques is to consider long-range relations (in addition to short-range relations) in the DNA sequence, spanning 104 nucleotides. If we had a good technique to measure such long-range relations, we would be able to estimate any existing self-affinity (fractality) in the DNA sequence, without any a priori assumptions about its structure. This would be a data-driven approach, rather than the common modeldriven approach. Along those lines, preliminary results have already been reported in the literature on a local self-similarity with a 180 bp periodicity in mammalian nuclear DNA sequence. Other publications have provided evidence that the long-range fractal correlations appear in DNA sequences with different values in different regions of the sequence. This paper describes such a multiscale approach, together with an algorithm based on a multifractal analysis, and demonstrates that multifractal estimates can be used to characterize DNA sequences [1], [2], [3]. This multifractal approach appears to be new, and may provide a key to cognitive analysis of DNA sequences. It should be clear that the DNA sequencing and gene finding techniques constitute a subset of bioinformatics, the science of using information to understand biology, with its numerous tools. In turn, bioinformatics is a subset of computational biology which is the application of quantitative analytical techniques in modelling biological systems. Very often, for structural biologists, DNA is not just a sequence of symbols, but implies 3D structures, molecular shapes an
Keywords :
DNA; bioinformatics; data mining; gene therapy; genetics; genomics; molecular biophysics; proteins; bioinformatics; biological process; cognitive analysis; computational biology; computational data mining technique; decision tree; deoxyribonucleic acid; dynamic programming; genetic information; genetic programming; hidden Markov chain; human genome project; mammalian nuclear DNA sequence; molecular biology community; neural network; probability reasoning; rule based method; support vector machines; Bioinformatics; Cognitive informatics; Computers; DNA; Fractals; Genomics; Humans; DNA sequences; feature extraction for classification; genomes; monoscale and multiscale measures; multifractal analysis;