مرکز منطقه ای اطلاع رساني علوم و فناوري - Accurate Estimation of Generalization Performance for Active Learning

Abstract :

Active learning is a crucial method in settings where a human labeling of instances is challenging to obtain. The typical active learning loop builds a model from a few labeled instances, chooses informative unlabeled instances, asks an Oracle (i.e. a human) to label them and then rebuilds the model. Active learning is widely used with much research attention focused on determining which instances to ask the human to label. However, an understudied problem is estimating the accuracy of the learner when instances are added actively. This is a problem because regular cross validation methods may not work well due to the bias in selecting instances to label. We show that existing methods to address the issue of estimating performance are not suitable for practitioners since the scaling coefficients can have high variance, the estimators can produce nonsensical results and the estimates are empirically inaccurate in the classification setting. We propose a new general active learning method which more accurately estimates generalization performance through a sampling step and a new weighted cross validation estimator. Our method can be used with a variety of query strategies and learners. We empirically illustrate the benefits of our method to the practitioner by showing it is more accurate than the standard weighted cross validation estimator and, when used as part of a termination criterion, obtains more accurate estimates of generalization error while having comparable generalization performance.