Title :
Noise identification with the k-means algorithm
Author :
Tang, Wei ; Khoshgoftaar, Taghi M.
Author_Institution :
Dept. of Comput. Eng., Florida Atlantic Univ., Boca Raton, FL, USA
Abstract :
The presence of noise in a measurement dataset can have a negative effect on the classification model built. More specifically, the noisy instances in the dataset can adversely affect the learnt hypothesis. Removal of noisy instances will improve the learnt hypothesis; thus, improving the classification accuracy of the model. A clustering-based noise detection approach using the k-means algorithm is presented. We present a new metric for measuring the potentiality (noise factor) of an instance being noisy. Based on the computed noise factor values of the instances, the clustering-based algorithm is then used to identify and eliminate p% of the instances in the dataset. These p% of instances are considered the most likely to be noisy among the instances in the dataset - the p% value is varied from 1% to 40%. The noise detection approach is investigated with respect to two case studies of software measurement data obtained from NASA software projects. The two datasets are characterized by the same thirteen software metrics and a class label that classifies the program modules as fault-prone and not fault-prone. It is shown that as more noisy instances are removed, classification accuracy of the C4>5 learner improves. This indicates that the removed instances are most likely noisy instances that attributed to poor classification accuracy.
Keywords :
data mining; noise measurement; pattern classification; pattern clustering; software metrics; software quality; very large databases; NASA software project; classification model; clustering-based noise detection; fault-prone; k-means algorithm; noise identification; noisy instance; software measurement dataset; software metric; Clustering algorithms; Data mining; Filters; Machine learning; NASA; Noise measurement; Predictive models; Software measurement; Software metrics; Software quality;
Conference_Titel :
Tools with Artificial Intelligence, 2004. ICTAI 2004. 16th IEEE International Conference on
Print_ISBN :
0-7695-2236-X
DOI :
10.1109/ICTAI.2004.93