DocumentCode :
1967469
Title :
The partitioning- and rule-based filter for noise detection
Author :
Xiao, Yudong ; Khoshgoftaar, Taghi M. ; Seliya, Naeem
Author_Institution :
Dept. of Electr. & Comput. Eng., Florida Atlantic Univ., Boca Raton, FL, USA
fYear :
2005
fDate :
15-17 Aug. 2005
Firstpage :
205
Lastpage :
210
Abstract :
The problem of poor data quality is prevalent across multiple domains. A quantitative presence of noise in a given dataset is often reflective of the quality of the data. Data noise is generally categorized into two groups: mislabeling errors (class noise) and attribute noise. In the literature, noise detection techniques such as ensemble filter, partitioning filter, data polishing etc. have been proposed. However, several of these techniques lack adequate noise detection accuracy. In addition, they simply filter instances as noisy without providing a relative sense of noise among those instances. A novel approach for noise detection - partitioning- and rule-based filter is proposed. The approach functions by aggregating four unique mechanisms to achieve high-accuracy in noise detection and to provide a relative noise-based ranking of instances. These mechanisms include: repeated data partitioning, inclusive evaluation, un-weighted voting, and dual-two-class-classifiers. The proposed approach is evaluated using datasets obtained from the UCI data repository. Empirical studies with simulated (artificial) noise injected into clean or benchmark datasets demonstrate the excellent noise detection performance - in many cases, a perfect or near-perfect performance is observed. In addition, the proposed approach depicted significantly better noise detection rates in detecting class noise than a proven existing approach, partitioning filter.
Keywords :
data mining; database management systems; noise; UCI data repository; attribute noise; data polishing; data quality; dual-two-class-classifiers; ensemble filter; mislabeling error; noise detection; partitioning filter; relative noise-based ranking; repeated data partitioning; rule-based filter; simulated artificial noise; unweighted voting; Filters;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Reuse and Integration, Conf, 2005. IRI -2005 IEEE International Conference on.
Print_ISBN :
0-7803-9093-8
Type :
conf
DOI :
10.1109/IRI-05.2005.1506474
Filename :
1506474
Link To Document :
بازگشت