Title :
Generating Synthetic Data to Match Data Mining Patterns
Author :
Eno, Josh ; Thompson, Craig W.
Author_Institution :
Univ. of Arkansas, Little Rock, AK
Abstract :
Synthetic data sets can be useful in a variety of situations, including repeatable regression testing and providing realistic - but not real - data to third parties for testing new software. Researchers, engineers, and software developers can test against a safe data set without affecting or even accessing the original data, insulating them from privacy and security concerns as well as letting them generate larger data sets than would be available using only real data. Practitioners use data mining technology to discover patterns in real data sets that aren´t apparent at the outset. This article explores how to combine information derived from data mining applications with the descriptive ability of synthetic data generation software. Our goal is to demonstrate that at least some data mining techniques (in particular, a decision tree) can discover patterns that we can then use to inverse map into synthetic data sets. These synthetic data sets can be of any size and will faithfully exhibit the same (decision tree) patterns. Our work builds on two technologies: synthetic data definition language and predictive model markup language.
Keywords :
data analysis; data mining; decision trees; pattern matching; Predictive Model Markup Language; Synthetic Data Definition Language; data mining pattern matching; decision tree; inverse map; pattern discovery; synthetic data generation software; Data engineering; Data mining; Data privacy; Data security; Decision trees; Information security; Insulation; Pattern matching; Software safety; Software testing; Synthetic data generation; data mining; decision trees;
Journal_Title :
Internet Computing, IEEE
DOI :
10.1109/MIC.2008.55