مرکز منطقه ای اطلاع رساني علوم و فناوري - Applicability of Regression-Tree-Based Synthetic Data Methods for Business Data

DocumentCode :

3128953

Title :

Applicability of Regression-Tree-Based Synthetic Data Methods for Business Data

Author :

Lee, Joo-Ho ; Kim, In-Yong ; O´Keefe, Christine M.

Author_Institution :

Dept. of Stat., Korea Univ., Seoul, South Korea

fYear :

2011

fDate :

11-11 Dec. 2011

Firstpage :

651

Lastpage :

658

Abstract :

This paper concerns the use of synthetic data for protecting the confidentiality of business data during statistical analysis. Synthetic datasets are constructed by replacing sensitive values in a confidential dataset with draws from statistical models estimated on the confidential dataset. Unfortunately, the process of generating effective statistical models can be a difficult and labour-intensive task. Recently, it has been proposed to use easily-implemented methods from machine learning instead of statistical model estimation in the data synthesis task. J. Drechsler and J.P. Reiter [1] have conducted an evaluation of four such methods, and have found that regression trees could give rise to synthetic datasets which provide reliable analysis results as well as low disclosure risks. Their conclusion was based on simulations using a subset of the 2002 Uganda census public use file, and it is an interesting question whether the same conclusion applies to other types of data with different characteristics. For example, business data have quite different characteristics from population census and survey data. Business data generally have few variables that are mostly categorical, and often have highly skewed distributions with outliers. In this paper we investigate the applicability of regression-tree-based methods for constructing synthetic business data. We give a detailed example comparing exploratory data analysis and linear regression results under two variants of a regression-tree-based synthetic data approach. We also include an evaluation of the analysis results with respect to the results of analysis of the original data. We further investigate the impact of different stopping criteria on performance. Our example provides evidence that synthesisers based on regression trees may not be immediately applicable in the context of business data. Further investigation, including further simulation studies with larger datasets, is certainly indicated.

Keywords :

business data processing; regression analysis; security of data; trees (mathematics); 2002 Uganda census public use file; business data; confidentiality protection; data synthesis task; exploratory data analysis; linear regression; machine learning; regression-tree-based synthetic data methods; statistical analysis; Business; Data analysis; Data models; Histograms; Regression tree analysis; Sugar; Sugar industry; Business Data; Confidentiality; Disclosure; Imputation;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on

Conference_Location :

Vancouver, BC

Print_ISBN :

978-1-4673-0005-6

Type :

conf

DOI :

10.1109/ICDMW.2011.32

Filename :

6137442

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3128953