• DocumentCode
    3533108
  • Title

    A text-mining approach for classification of genomic fragments

  • Author

    Gadia, Vinay ; Rosen, Gail

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Drexel Univ., Philadelphia, PA
  • fYear
    2008
  • fDate
    3-5 Nov. 2008
  • Firstpage
    107
  • Lastpage
    108
  • Abstract
    Genome identification is an emerging area of interest due to the study of environmental DNA samples. We show that performance approaches 50% for classifying 500 bp fragments when using 12 mer features, but more importantly, the performance linearly increases for large N. Secondly, we determine that an inverted TF-IDF measure performs 16% better when only using 80% of the words, as opposed to taking the fullset (100%). This increase implies that while too sparse of a feature subset does not produce good results, a carefully selected set has the potential to improve genome classification over a random feature set. Computing even 80% of all possible features can result in a significant savings in computation. The Euclidean classifier and TF-IDF measures will pave the way for more discriminative classification techniques.
  • Keywords
    biocomputing; biology computing; data mining; pattern classification; text analysis; Euclidean classifier; discriminative classification techniques; environmental DNA samples; genome identification; genomic fragments classification; inverted TF-IDF measure; text-mining approach; Bioinformatics; DNA; Data analysis; Euclidean distance; Frequency; Genomics; Performance evaluation; Phylogeny; Spatial databases; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Bioinformatics and Biomeidcine Workshops, 2008. BIBMW 2008. IEEE International Conference on
  • Conference_Location
    Philadelphia, PA
  • Print_ISBN
    978-1-4244-2890-8
  • Type

    conf

  • DOI
    10.1109/BIBMW.2008.4686216
  • Filename
    4686216