Title :
A performance comparison of dimension reduction methods for molecular structure classification
Author :
Zhi-Shui Zhang ; Li-Li Cao ; Jun Zhang
Author_Institution :
Sch. of Electron. Eng. & Autom., Anhui Univ., Hefei, China
Abstract :
Mass spectrometry is a powerful tool in chemistry research. A primary aim of data mining in chemistry is to try to obtain useful information from chemistry databases, and then classify the compounds using the useful samples features. Suffering from the traits of high dimension, and small sample in mass spectrometry data, in order to create models, it will be first to provide useful features which are used to analyze, create mining models, and define the best parameters. We focus on the dimension reduction methods and applications in analysis of mass spectra. In this paper, we used several methods such as Principal Component Analysis (PCA), Multidimensional Scaling (MDS) and Isometric Mapping (ISOMAP), Laplacian Eigenmaps, t-Distributed Stochastic Neighbor Embedding (tSNE) and Large Margin NN Classifier (LMNN) and apply them to reduce the dimension of mass spectra. At last, the AdaBoost algorithm united with Classification and Regression Tree (AdaBoost-CART) is used to train a more useful classifier to predict the 11 substructures using the mass spectral features set. The results demonstrate that LMNN can receive a more useful low dimensional dataset to improve the classification accuracy on mass spectral data.
Keywords :
chemistry computing; data analysis; data mining; eigenvalues and eigenfunctions; learning (artificial intelligence); mass spectra; pattern classification; principal component analysis; regression analysis; stochastic processes; AdaBoost algorithm; AdaBoost-CART; ISOMAP; LMNN; Laplacian eigenmaps; MDS; PCA; chemistry databases; classification and regression tree; data mining; isometric mapping; large margin NN classifier; mass spectra data analysis; mass spectra dimension reduction methods; mass spectrometry; molecular structure classification; multidimensional scaling; principal component analysis; t-distributed stochastic neighbor embedding; tSNE; Accuracy; Algorithm design and analysis; Chemicals; Classification algorithms; Laplace equations; Libraries; Principal component analysis; Classification; Data mining; Mass spectra; dimension reduction;
Conference_Titel :
Biomedical Engineering and Informatics (BMEI), 2014 7th International Conference on
Conference_Location :
Dalian
Print_ISBN :
978-1-4799-5837-5
DOI :
10.1109/BMEI.2014.7002890