DocumentCode
2207635
Title
An Extensive Empirical Study on Semi-supervised Learning
Author
Guo, Yuanyuan ; Niu, Xiaoda ; Zhang, Harry
Author_Institution
Fac. of Comput. Sci., Univ. of New Brunswick, Fredericton, NB, Canada
fYear
2010
fDate
13-17 Dec. 2010
Firstpage
186
Lastpage
195
Abstract
Semi-supervised classification methods utilize unlabeled data to help learn better classifiers, when only a small amount of labeled data is available. Many semi-supervised learning methods have been proposed in the past decade. However, some questions have not been well answered, e.g., whether semi-supervised learning methods outperform base classifiers learned only from the labeled data, when different base classifiers are used, whether selecting unlabeled data with efforts is superior to random selection, and how the quality of the learned classifier changes at each iteration of learning process. This paper conducts an extensive empirical study on the performance of several commonly used semi-supervised learning methods when different Bayesian classifiers (NB, NBTree, TAN, HGC, HNB, and DNB) are used as the base classifier, respectively. Results on Transductive SVM and a graph-based semi-supervised learning method LLGC are also studied for comparison. The experimental results on 26 UCI datasets and 6 widely used benchmark datasets show that these semi-supervised learning methods generally do not obtain better performance than classifiers learned only from the labeled data. Moreover, for standard self-training and co-training, when selecting the most confident unlabeled instances during learning process, the performance is not necessarily better than that of random selection of unlabeled instances. We especially discovered interesting outcomes when drawing learning curves for using NB in self-training on some UCI datasets. The accuracy of the learned classifier on the testing set may fluctuate or decrease as more unlabeled instances are used. Also on the mushroom dataset, even when all the selected unlabeled instances are correctly labeled in each iteration, the accuracy on the testing set still goes down.
Keywords
Bayes methods; graphs; learning (artificial intelligence); pattern classification; Bayesian classifier; LLGC; UCI dataset; base classifier; benchmark dataset; graph based semisupervised learning method; learning curve; learning process; mushroom dataset; random selection; transductive SVM; unlabeled data; unlabeled instance; Bayesian classifiers; Semi-supervised learning;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Mining (ICDM), 2010 IEEE 10th International Conference on
Conference_Location
Sydney, NSW
ISSN
1550-4786
Print_ISBN
978-1-4244-9131-5
Electronic_ISBN
1550-4786
Type
conf
DOI
10.1109/ICDM.2010.66
Filename
5693972
Link To Document