DocumentCode :
2191196
Title :
Comparative analysis of machine learning techniques for the prediction of logP
Author :
Lowe, Edward W., Jr. ; Butkiewicz, Mariusz ; Spellings, Matthew ; Omlor, Albert ; Meiler, Jens
Author_Institution :
Center for Struct. Biol., Vanderbilt Univ., Nashville, TN, USA
fYear :
2011
fDate :
11-15 April 2011
Firstpage :
1
Lastpage :
6
Abstract :
Several machine learning techniques were evaluated for the prediction of logP. The algorithms used include artificial neural networks (ANN), support vector machines (SVM) with the extension for regression, and kappa nearest neighbor (k-NN). Molecules were described using optimized feature sets derived from a series of scalar, two- and three-dimensional descriptors including 2-D and 3-D autocorrelation, and radial distribution function. Feature optimization was performed as a sequential forward feature selection. The data set contained over 25,000 molecules with experimentally determined logP values collected from the Reaxys and MDDR databases, as well as data mining through SciFinder. LogP, the logarithm of the equilibrium octanol-water partition coefficient for a given substance is a metric of the hydrophobicity. This property is an important metric for drug absorption, distribution, metabolism, and excretion (ADME). In this work, models were built by systematically optimizing feature sets and algorithmic parameters that predict logP with a root mean square deviation (rmsd) of 0.86 for compounds in an independent test set. This result presents a substantial improvement over XlogP, an incremental system that achieves a rmsd of 1.41 over the same dataset. The final models were 5-fold cross-validated. These fully in silico models can be useful in guiding early stages of drug discovery, such as virtual library screening and analogue prioritization prior to synthesis and biological testing. These models are freely available for academic use.
Keywords :
biological techniques; data mining; drugs; feature extraction; hydrophobicity; learning (artificial intelligence); optimisation; parallel processing; radial basis function networks; regression analysis; 2D autocorrelation; 3D autocorrelation; ANN; LogP prediction; MDDR database; Reaxys database; SVM; SciFinder; XlogP; artificial neural networks; biological testing; data mining; drug absorption; drug discovery; drug distribution; drug excretion; drug metabolism; feature optimization; hydrophobicity; kappa nearest neighbor method; machine learning technique; metabolism; octanol-water partition coefficient; optimized feature sets; radial distribution function; regression analysis; sequential forward feature selection; support vector machines; three-dimensional descriptor; two-dimensional descriptor; Artificial neural networks; Compounds; Drugs; Machine learning; Predictive models; Support vector machines; Training;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2011 IEEE Symposium on
Conference_Location :
Paris
Print_ISBN :
978-1-4244-9896-3
Type :
conf
DOI :
10.1109/CIBCB.2011.5948478
Filename :
5948478
Link To Document :
بازگشت