Title :
Why are neural networks sometimes much more accurate than decision trees: an analysis on a bio-informatics problem
Author :
Hall, Lawrence O. ; Liu, Ximmei ; Bowyer, Kevin W. ; Banfield, Robert
Author_Institution :
Dept. of C. S. & E, Univ. of South Florida, Tampa, FL, USA
Abstract :
Bio-informatics data sets may be large in the number of examples and/or the number of features. Predicting the secondary structure of proteins from amino acid sequences is one example of high dimensional data for which large training sets exist. The data from the KDD Cup 2001 on the binding of compounds to thrombin is another example of a very high dimensional data set. This type of data set can require significant computing resources to train a neural network. In general, decision trees will require much less training time than neural networks. There have been a number of studies on the advantages of decision trees relative to neural networks for specific data sets. There are often statistically significant, though typically not very large, differences. Here, we examine one case in which a neural network greatly outperforms a decision tree; predicting the secondary structure of proteins. The hypothesis that the neural network learns important features of the data through its hidden units is explored by a using a neural network to transform data for decision tree training. Experiments show that this explains some of the performance difference, but not all. Ensembles of decision trees are compared with a single neural network. It is our conclusion that the problem of protein secondary structure prediction exhibits some characteristics that are fundamentally better exploited by a neural network model.
Keywords :
biology computing; data mining; decision trees; learning (artificial intelligence); neural nets; proteins; sequences; very large databases; KDD Cup 2001; amino acid sequences; bio-informatics data sets; bio-informatics problem; decision trees; high dimensional data; neural networks; protein secondary structure prediction; thrombin; training sets; training time; Amino acids; Bioinformatics; Computer networks; Computer science; Decision trees; Neural networks; Predictive models; Protein engineering; Testing; Training data;
Conference_Titel :
Systems, Man and Cybernetics, 2003. IEEE International Conference on
Print_ISBN :
0-7803-7952-7
DOI :
10.1109/ICSMC.2003.1244318