• DocumentCode
    3477417
  • Title

    A Method for Evaluating Quality of Clustering DNA Fragments Encoded in Different Nucleotide Frequencies

  • Author

    Chan, Chon-Kit Kenneth ; Hsu, Arthur L. ; Tang, Sen-Lin ; Halgamuge, Saman K.

  • Author_Institution
    DoMME, Univ. of Melbourne, Melbourne, VIC
  • fYear
    2007
  • fDate
    11-13 Oct. 2007
  • Firstpage
    60
  • Lastpage
    63
  • Abstract
    The whole-genome shotgun sequencing technique has been successfully applied to environmental genomes. However, a considerable amount of DNA sequences and small contigs remain generally unassembled after the shotgun sequencing. Binning is a step of grouping these sequences based on some biological and molecular features. The combination of oligonucleotide frequency and Self-Organising Maps (SOM) clustering algorithm shows high potential as a compositional binning tool. As the previous work did not provide methods for assessing results, we proposed a systematic quantitative method to evaluate the clustering results specifically for this type of application. We used this method to investigate the suitability of each of di, tri, tetra and pentanucleotide frequencies as training feature for this binning technique. The results show that dinucleotide frequency is unable to bin Wkb DNA sequence fragments into well-clustered species groups. Furthermore, we noticed that increasing order of oligonucleotide frequency may deteriorate the assignment of DNA sequences to classes in our test, which indicates the possible existence of optimal species-specific oligonucleotide frequency. Results suggest that using trinucleotide frequency for the combination of oligonucleotide frequency and SOM as a binning process gives sufficiently good clustering quality in this case.
  • Keywords
    DNA; biology computing; genetics; molecular biophysics; molecular configurations; DNA fragments; clustering algorithm; compositional binning tool; nucleotide frequencies; self-organising maps; whole-genome shotgun sequencing; Assembly; Biodiversity; Bioinformatics; Control systems; DNA; Frequency; Genomics; Information technology; Sequences; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Frontiers in the Convergence of Bioscience and Information Technologies, 2007. FBIT 2007
  • Conference_Location
    Jeju City
  • Print_ISBN
    978-0-7695-2999-8
  • Type

    conf

  • DOI
    10.1109/FBIT.2007.70
  • Filename
    4524080