DocumentCode
141002
Title
Data quality: The other face of Big Data
Author
Saha, Balaram ; Srivastava, Divesh
Author_Institution
AT&T Labs.-Res., Florham Park, NJ, USA
fYear
2014
fDate
March 31 2014-April 4 2014
Firstpage
1294
Lastpage
1297
Abstract
In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth `V´ of big data is increasingly being recognized. In this tutorial, we highlight the substantial challenges that the first three `V´s, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.
Keywords
Big Data; Internet; decision making; quality management; Web databases; big data quality management; data analysis; data variety; data velocity; data volume; data-driven decision making; source diversity; Cleaning; Data handling; Data storage systems; Databases; Information management; Maintenance engineering; Quality management;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Engineering (ICDE), 2014 IEEE 30th International Conference on
Conference_Location
Chicago, IL
Type
conf
DOI
10.1109/ICDE.2014.6816764
Filename
6816764
Link To Document