• DocumentCode
    2207635
  • Title

    An Extensive Empirical Study on Semi-supervised Learning

  • Author

    Guo, Yuanyuan ; Niu, Xiaoda ; Zhang, Harry

  • Author_Institution
    Fac. of Comput. Sci., Univ. of New Brunswick, Fredericton, NB, Canada
  • fYear
    2010
  • fDate
    13-17 Dec. 2010
  • Firstpage
    186
  • Lastpage
    195
  • Abstract
    Semi-supervised classification methods utilize unlabeled data to help learn better classifiers, when only a small amount of labeled data is available. Many semi-supervised learning methods have been proposed in the past decade. However, some questions have not been well answered, e.g., whether semi-supervised learning methods outperform base classifiers learned only from the labeled data, when different base classifiers are used, whether selecting unlabeled data with efforts is superior to random selection, and how the quality of the learned classifier changes at each iteration of learning process. This paper conducts an extensive empirical study on the performance of several commonly used semi-supervised learning methods when different Bayesian classifiers (NB, NBTree, TAN, HGC, HNB, and DNB) are used as the base classifier, respectively. Results on Transductive SVM and a graph-based semi-supervised learning method LLGC are also studied for comparison. The experimental results on 26 UCI datasets and 6 widely used benchmark datasets show that these semi-supervised learning methods generally do not obtain better performance than classifiers learned only from the labeled data. Moreover, for standard self-training and co-training, when selecting the most confident unlabeled instances during learning process, the performance is not necessarily better than that of random selection of unlabeled instances. We especially discovered interesting outcomes when drawing learning curves for using NB in self-training on some UCI datasets. The accuracy of the learned classifier on the testing set may fluctuate or decrease as more unlabeled instances are used. Also on the mushroom dataset, even when all the selected unlabeled instances are correctly labeled in each iteration, the accuracy on the testing set still goes down.
  • Keywords
    Bayes methods; graphs; learning (artificial intelligence); pattern classification; Bayesian classifier; LLGC; UCI dataset; base classifier; benchmark dataset; graph based semisupervised learning method; learning curve; learning process; mushroom dataset; random selection; transductive SVM; unlabeled data; unlabeled instance; Bayesian classifiers; Semi-supervised learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining (ICDM), 2010 IEEE 10th International Conference on
  • Conference_Location
    Sydney, NSW
  • ISSN
    1550-4786
  • Print_ISBN
    978-1-4244-9131-5
  • Electronic_ISBN
    1550-4786
  • Type

    conf

  • DOI
    10.1109/ICDM.2010.66
  • Filename
    5693972