Title :
A Partial Least Squares Based Procedure for Upstream Sequence Classification in Prokaryotes
Author :
Mehmood, Tahir ; Bohlin, Jon ; Snipen, Lars
Author_Institution :
Dept. of Chem., Biotechnol. & Food Sci., Norwegian Univ. of Life Sci., Akershous, Norway
Abstract :
The upstream region of coding genes is important for several reasons, for instance locating transcription factor, binding sites, and start site initiation in genomic DNA. Motivated by a recently conducted study, where multivariate approach was successfully applied to coding sequence modeling, we have introduced a partial least squares (PLS) based procedure for the classification of true upstream prokaryotic sequence from background upstream sequence. The upstream sequences of conserved coding genes over genomes were considered in analysis, where conserved coding genes were found by using pan-genomics concept for each considered prokaryotic species. PLS uses position specific scoring matrix (PSSM) to study the characteristics of upstream region. Results obtained by PLS based method were compared with Gini importance of random forest (RF) and support vector machine (SVM), which is much used method for sequence classification. The upstream sequence classification performance was evaluated by using cross validation, and suggested approach identifies prokaryotic upstream region significantly better to RF (p-value <; 0.01) and SVM (p-value <; 0.01). Further, the proposed method also produced results that concurred with known biological characteristics of the upstream region.
Keywords :
DNA; bioinformatics; cellular biophysics; classification; genetics; genomics; least squares approximations; molecular biophysics; molecular configurations; support vector machines; Gini importance; PLS; PSSM; RF; SVM; background upstream sequence; binding sites; coding genes; coding sequence modeling; genomic DNA; pan-genomics concept; partial least squares; position specific scoring matrix; prokaryotes; random forest; site initiation; support vector machine; transcription factor; true upstream prokaryotic sequence; upstream sequence classification; Bioinformatics; Encoding; Genomics; Radio frequency; Strain; Support vector machines; Vectors; Partial Least Squares; Partial least squares; classification; prokaryotes;
Journal_Title :
Computational Biology and Bioinformatics, IEEE/ACM Transactions on
DOI :
10.1109/TCBB.2014.2366146