مرکز منطقه ای اطلاع رساني علوم و فناوري - Performance analysis for machine-learning experiments using small data sets

Abstract :

Machine-learning techniques are increasingly used to deal with a variety of problems in agriculture. However, challenges with the application of machine-learning, such as analyzing the performance achieved through learning from small data sets, still remain. This study focused on using graphical and statistical techniques to analyze the results of machine-learning experiments involving data preprocessing and algorithm tuning. Data consisted of 1428 cases that were classified by a dairy-nutrition specialist as outliers (34 cases) or non-outliers. The performance of classifiers, generated with decision-tree induction, was estimated through ten-fold cross validation. Relative operating characteristic (ROC) curves were used to visualize the achieved trade-offs between correctly classifying positive and negative cases. A performance index, representing the mean true positive rate of these curves for a limited range of false positive rate values, was developed to facilitate comparison among classification schemes. Analysis of variance (ANOVA) was used to determine whether real differences existed for the expected performance on new data among the different combinations of data preprocessing and algorithm configurations evaluated in this study. In terms of data preprocessing, randomly assigning herds to the folds of the cross validation did not perform significantly differently from assigning cases to folds, while using a special value to indicate irrelevant attribute values significantly improved the performance over treating these values as unknown. Tuning the configuration of the decisiontree induction algorithm significantly improved the classification performance. The application of ten-fold cross validation in combination with ROC curves and ANOVA was found to be useful in analyzing the results of machine-learning experiments involving decision-tree induction and small data sets. These methods could also be used with other machine-learning techniques such as artificial neural networks and instance-based learning.