Author_Institution :
Dept. of Phys., South China Univ. of Technol., Guangzhou, China
Abstract :
Similarity comparison between two biological sequences is one of the main problems in computational biology research. A powerful statistical method D2 which depends on the joint k-tuples content in the two sequences, has been applied to the alignment-free sequences comparison. Two mutually independent random sequences under the null model have been produced, which is composed by AT-rich (PA=PT=0.33, PC=PG=0.17) distribution, and based on the null model, we got two foreground sequences with Bernoulli variables by a pattern transfer model. For the foreground sequences, by comparing local sequences pairs and then summing over all the local sequences pairs of certain length, and the local alignment-free of two sequences has been tested by statistics D2, D2star, D2shepp, then from the power of the three statistics, we can find the optimal parameters. The simulation results show that D2star is better than D2shepp, and D2 is relatively weak. We also analyze the power value distribution under different parameters, including Bernoulli variable g and tuple size k and type I Error. At the same time by comparing the proposed local with global-alignment-free about D2star, and D2shepp under the same parameters, it showed that the power of local alignment-free based on D2star tends to 1 quickly with the increase of the length of the sequence, faster and more accurate than the global alignment.