DocumentCode :
3495568
Title :
An unsupervised learning approach to assembly validation
Author :
Lanc, Irena ; Emrich, Scott
Author_Institution :
Dept. of Comput. Sci. & Eng., Univ. of Notre Dame, South Bend, IN, USA
fYear :
2013
fDate :
12-14 June 2013
Firstpage :
1
Lastpage :
6
Abstract :
The dramatic decrease in sequencing cost has led to an unprecedented number of new projects. Because finishing a genome often involves painstaking work, most sequencing efforts are satisfied by an initial draft assembly. Reliable analysis of synteny, gene expansion, and structural variation, however, is contingent upon the available draft being an accurate representation of the genome. If the genome has not been thoroughly vetted, this trust may be misplaced and misassemblies can undermine any conclusions drawn about the structure and content of a genome. Here, we developed and tested the use of unsupervised learning algorithms to uncover outliers that point to improperly assembled segments. We first modified the existing amosvalidate software to extract detailed assembly features from the raw assembly file. We then used this information to construct feature vectors for each section of the assembly, and clustered these using various unsupervised learning techniques. Our results show that misassembled regions clearly differentiate themselves from correctly assembled regions, particularly in certain dimensions such as read coverage and SNP placement.
Keywords :
biology computing; feature extraction; genetics; genomics; self-assembly; unsupervised learning; SNP placement; gene expansion analysis; genome assembly feature extraction; genome assembly feature vector construction; genome clustering; genome representation; genome sequencing; misassembled region; read coverage; structural variation analysis; synteny analysis; unsupervised learning algorithm; Assembly; Bioinformatics; Clustering algorithms; Entropy; Genomics; Sequential analysis; Unsupervised learning; Backbone Torsion Potential; Local Interactions; Protein Loop Structure;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Advances in Bio and Medical Sciences (ICCABS), 2013 IEEE 3rd International Conference on
Conference_Location :
New Orleans, LA
Type :
conf
DOI :
10.1109/ICCABS.2013.6629196
Filename :
6629196
Link To Document :
بازگشت