Author/Authors :
Jing, Xiao-Yang Inner Mongolia Agricultural University - Hohhot, China , Li, Feng-Min Inner Mongolia Agricultural University - Hohhot, China
Abstract :
Heat shock proteins (HSPs) are ubiquitous in living organisms. HSPs are an essential component for cell growth and survival; the
main function of HSPs is controlling the folding and unfolding process of proteins. According to molecular function and mass,
HSPs are categorized into six different families: HSP20 (small HSPS), HSP40 (J-proteins), HSP60, HSP70, HSP90, and HSP100.
In this paper, improved methods for HSP prediction are proposed—the split amino acid composition (SAAC), the dipeptide
composition (DC), the conjoint triad feature (CTF), and the pseudoaverage chemical shift (PseACS) were selected to predict the
HSPs with a support vector machine (SVM). In order to overcome the imbalance data classification problems, the syntactic
minority oversampling technique (SMOTE) was used to balance the dataset. The overall accuracy was 99.72% with a balanced
dataset in the jackknife test by using the optimized combination feature SAAC+DC+CTF+PseACS, which was 4.81% higher
than the imbalanced dataset with the same combination feature. The Sn, Sp, Acc, and MCC of HSP families in our predictive
model were higher than those in existing methods. This improved method may be helpful for protein function prediction.