DocumentCode
3169048
Title
An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics
Author
Yoon, Kihoon ; Kwek, Stephen
Author_Institution
Dept. of Comput. Sci., Texas Univ., San Antonio, TX, USA
fYear
2005
fDate
6-9 Nov. 2005
Abstract
Learning from imbalanced data occurs very frequently in functional genomic applications. One positive example to thousands of negative instances is common in scientific applications. Unfortunately, traditional machine learning treats the extremely small instances as noise. The standard approach for this difficulty is balancing training data by resampling them. However, this results in high false positive predictions. Hence, we propose preprocessing majority instances by partitioning them into clusters. This greatly reduces the ambiguity between minority instances and instances in each cluster. For moderately high imbalance ratio and low in-class complexity, our technique gives better prediction accuracy than undersampling method. For extreme imbalance ratio like splice site prediction problem, we demonstrate that this technique serves as a good filter with almost perfect recall that reduces the amount of imbalance so that traditional classification techniques can be deployed and yield significant improvements over previous predictor. We also show that the technique works for sub cellular localization and post-translational modification site prediction problems.
Keywords
biology computing; genetics; pattern classification; unsupervised learning; data imbalance; functional genomics; machine learning; post-translational modification site prediction; subcellular localization; supervised learning; unsupervised learning; Application software; Bioinformatics; Computer science; Genomics; Machine learning; Proteins; Supervised learning; Testing; Training data; Unsupervised learning;
fLanguage
English
Publisher
ieee
Conference_Titel
Hybrid Intelligent Systems, 2005. HIS '05. Fifth International Conference on
Print_ISBN
0-7695-2457-5
Type
conf
DOI
10.1109/ICHIS.2005.23
Filename
1587765
Link To Document