DocumentCode :
3163784
Title :
A Masking Index for Quantifying Hidden Glitches
Author :
Berti-Equille, Laure ; Ji Meng Loh ; Dasu, Tamraparni
Author_Institution :
IRD ESPACE DEV, Montpellier, France
fYear :
2013
fDate :
7-10 Dec. 2013
Firstpage :
21
Lastpage :
30
Abstract :
Data glitches are errors in a data set, they are complex entities that often span multiple attributes and records. When they co-occur in data, the presence of one type of glitch can hinder the detection of another type of glitch. This phenomenon is called masking. In this paper, we define two important types of masking, and we propose a novel, statistically rigorous indicator called masking index for quantifying the hidden glitches in four cases of masking: outliers masked by missing values, outliers masked by duplicates, duplicates masked by missing values, and duplicates masked by outliers. The masking index is critical for data quality profiling and data exploration, it enables a user to measure the extent of masking and hence the confidence in the data. In this sense, it is a valuable data quality index for measuring the true cleanliness of the data. It is also an objective and quantitative basis for choosing an anomaly detection method that is best suited for the glitches that are present in any given data set. We demonstrate the utility and effectiveness of the masking index by intensive experiments on synthetic and real-world datasets.
Keywords :
data analysis; anomaly detection method; data exploration; data glitch detection; data quality index; data quality profiling; masking index; missing values; outliers; real-world datasets; statistical rigorous indicator; synthetic datasets; true cleanliness; Arrays; Cleaning; Data mining; Equations; Indexes; Robustness; Software; Anomaly detection; data cleaning; duplicate record identification; masking; missing values; outlier detection;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining (ICDM), 2013 IEEE 13th International Conference on
Conference_Location :
Dallas, TX
ISSN :
1550-4786
Type :
conf
DOI :
10.1109/ICDM.2013.16
Filename :
6729486
Link To Document :
بازگشت