Title :
Identifying learners robust to low quality data
Author :
Folleco, Andres ; Khoshgoftaar, Taghi M. ; Van Hulse, Jason ; Bullard, Lofton
Author_Institution :
Florida Atlantic University, Boca Raton, USA
Abstract :
Real world datasets commonly contain noise that is distributed in both the independent and dependent variables. Noise, which typically consists of erroneous variable values, has been shown to significantly affect the classification performance of learners. In this study, we identify learners with robust performance in the presence of low quality (noisy) measurement data. Noise was injected into five class imbalanced software engineering measurement datasets, initially relatively free of noise. The experimental factors considered included the learner used, the level of injected noise, the dataset used (each with unique properties), and the percentage of minority instances containing noise. No other related studies were found that have identified learners that are robust in the presence of low quality measurement data. Based on the results of this study, we recommend using the random forest learner for building classification models from noisy data.
Keywords :
Data mining; Decision trees; Machine learning; Noise level; Noise measurement; Noise robustness; Software measurement; Support vector machine classification; Support vector machines; Working environment noise; learning performance; quality of data; random forest; software measurement data;
Conference_Titel :
Information Reuse and Integration, 2008. IRI 2008. IEEE International Conference on
Conference_Location :
Las Vegas, NV, USA
Print_ISBN :
978-1-4244-2659-1
Electronic_ISBN :
978-1-4244-2660-7
DOI :
10.1109/IRI.2008.4583028