An Extensive Empirical Study on Semi-supervised Learning

Author

Guo, Yuanyuan ; Niu, Xiaoda ; Zhang, Harry

Author_Institution

Fac. of Comput. Sci., Univ. of New Brunswick, Fredericton, NB, Canada

fYear

2010

fDate

13-17 Dec. 2010

Firstpage

186

Lastpage

195

Abstract

Semi-supervised classification methods utilize unlabeled data to help learn better classifiers, when only a small amount of labeled data is available. Many semi-supervised learning methods have been proposed in the past decade. However, some questions have not been well answered, e.g., whether semi-supervised learning methods outperform base classifiers learned only from the labeled data, when different base classifiers are used, whether selecting unlabeled data with efforts is superior to random selection, and how the quality of the learned classifier changes at each iteration of learning process. This paper conducts an extensive empirical study on the performance of several commonly used semi-supervised learning methods when different Bayesian classifiers (NB, NBTree, TAN, HGC, HNB, and DNB) are used as the base classifier, respectively. Results on Transductive SVM and a graph-based semi-supervised learning method LLGC are also studied for comparison. The experimental results on 26 UCI datasets and 6 widely used benchmark datasets show that these semi-supervised learning methods generally do not obtain better performance than classifiers learned only from the labeled data. Moreover, for standard self-training and co-training, when selecting the most confident unlabeled instances during learning process, the performance is not necessarily better than that of random selection of unlabeled instances. We especially discovered interesting outcomes when drawing learning curves for using NB in self-training on some UCI datasets. The accuracy of the learned classifier on the testing set may fluctuate or decrease as more unlabeled instances are used. Also on the mushroom dataset, even when all the selected unlabeled instances are correctly labeled in each iteration, the accuracy on the testing set still goes down.

Keywords

Bayes methods; graphs; learning (artificial intelligence); pattern classification; Bayesian classifier; LLGC; UCI dataset; base classifier; benchmark dataset; graph based semisupervised learning method; learning curve; learning process; mushroom dataset; random selection; transductive SVM; unlabeled data; unlabeled instance; Bayesian classifiers; Semi-supervised learning;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Mining (ICDM), 2010 IEEE 10th International Conference on

Conference_Location

Sydney, NSW

ISSN

1550-4786

Print_ISBN

978-1-4244-9131-5

Electronic_ISBN

1550-4786

Type

conf

DOI

10.1109/ICDM.2010.66

Filename

5693972