DocumentCode :
3499591
Title :
Optimistic bias in the assessment of high dimensional classifiers with a limited dataset
Author :
Chen, Weijie ; Brown, David G.
Author_Institution :
Food & Drug Adm., Silver Spring, MD, USA
fYear :
2011
fDate :
July 31 2011-Aug. 5 2011
Firstpage :
2698
Lastpage :
2703
Abstract :
It is commonly recognized that using the same dataset for training and testing the classifier introduces optimistic bias in estimating classifier performance. However, bias of the same kind may still exist even when independent datasets are used for training and testing a classifier. This problem is especially important in the setting of high dimensional feature space and limited data. Bioinformatics data is typically characterized by a tremendous amount of data per patient but from a limited number of patients. Often the entire data set is utilized in a “pre-training” stage during which the feature set is winnowed to a manageable number, and the parameters of the training algorithm are established. Subsequently the data is bifurcated into training and test sets; however, bias has already been introduced into the classifier development process. We investigate the significance of this bias by performing simulated gene expression experiments. We find that, for data with moderate intrinsic separability and modest sample size, any observed separation is due to selection bias introduced in the aforementioned pre-training process. For greater intrinsic separability, correct data hygiene, i.e., complete separation of development and validation data yields a positive result, but one far less impressive than that mistakenly obtained using incomplete data separation.
Keywords :
bioinformatics; genetics; pattern classification; bioinformatics data; classifier development process; gene expression; high dimensional classifier; optimistic bias; training algorithm; Breast cancer; Classification algorithms; Covariance matrix; Measurement; Signal to noise ratio; Testing; Training;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Neural Networks (IJCNN), The 2011 International Joint Conference on
Conference_Location :
San Jose, CA
ISSN :
2161-4393
Print_ISBN :
978-1-4244-9635-8
Type :
conf
DOI :
10.1109/IJCNN.2011.6033572
Filename :
6033572
Link To Document :
بازگشت