DocumentCode :
3239675
Title :
A mixture model framework for class discovery and outlier detection in mixed labeled/unlabeled data sets
Author :
Miller, David J. ; Browning, John
Author_Institution :
Dept. of Electr. Eng., Pennsylvania State Univ., University Park, PA, USA
fYear :
2003
fDate :
17-19 Sept. 2003
Firstpage :
489
Lastpage :
498
Abstract :
Several authors have addressed learning as a classifier given by a mixed labeled/unlabeled training set. These works assumes the unlabeled sample originates from one of the (known) classes. This work considers the scenario in which unlabeled points may belong either to known/predefined or to here-to-fore undiscovered classes. There are several practical situations where such data may arise. We earlier proposed a novel statistical mixture model to fit in this mixed data. In this paper we review the method and introduce an alternative model. Our fundamental strategy is to view as observed the data not only the feature vector and the class label, but also the fact of label presence/absence for each point. Two types of mixture components are used to explain label presence/absence. "Predefined" components generate both labeled and unlabeled points and assume the labels that are missing at random. These components represent the known classes. "Non-predefined" components only generate unlabeled points. In localized regions, the data subsets are captured exclusively unlabeled. Such subsets may represent an outlier distribution, or new classes. The components\´ predefined/non-predefined natures are data-driven, learned with the other parameters via an algorithm based on expectation-maximization (EM). There are three natural applications presented: 1) robust classifier design, given by a mixed training set with outliers; 2) classification with rejections; and 3) identification of the unlabeled points (and their representative components) originated from unknown classes, i.e. new class discovery. The effectiveness of our models in discovering purely unlabeled data components (potential new classes) is evaluated both by synthetic and real data sets. Although each of our models has its own advantages, the original model is found is achieved by the best class discovery results.
Keywords :
data handling; learning (artificial intelligence); optimisation; set theory; signal detection; expectation-maximization; labeled/unlabeled data sets; outlier detection; real data sets; robust classifier design; statistical mixture model; Character recognition; Databases; Humans; Internet; Labeling; Maximum likelihood estimation; Parameter estimation; Remote sensing; Robustness; Uncertainty;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Neural Networks for Signal Processing, 2003. NNSP'03. 2003 IEEE 13th Workshop on
ISSN :
1089-3555
Print_ISBN :
0-7803-8177-7
Type :
conf
DOI :
10.1109/NNSP.2003.1318048
Filename :
1318048
Link To Document :
بازگشت