• DocumentCode
    652142
  • Title

    Perturbed Gibbs Samplers for Generating Large-Scale Privacy-Safe Synthetic Health Data

  • Author

    Park, Yu-Seop ; Ghosh, Joydeb ; Shankar, M.

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Univ. of Texas at Austin, Austin, TX, USA
  • fYear
    2013
  • fDate
    9-11 Sept. 2013
  • Firstpage
    493
  • Lastpage
    498
  • Abstract
    This paper introduces a non-parametric data synthesizing algorithm to generate privacy-safe ``realistic but not real´´ synthetic health data. Our goal is to provide a systematic mechanism that guarantees an adequate and controllable level of privacy while substantially improving on the utility of public use data, compared to current practices by CMS, OSHPD and other agencies. The proposed algorithm synthesizes artificial records while preserving the statistical characteristics of the original data to the extent possible. The risk from ``database linking attack´´ is quantified by either an l-diversified or an ϵ-differentially perturbed data generation process. Moreover its algorithmic performance is optimized using Locality-Sensitive Hashing and parallel computation techniques to yield a linear-time algorithm that is suitable for Big Data Health applications. We synthesize a public Medicare claim dataset using the proposed algorithm, and demonstrate multiple data mining applications and statistical analyses using the data. The synthetic dataset delivers results that are substantially identical to those obtained from the original dataset, without revealing the actual records.
  • Keywords
    cryptography; data mining; data privacy; health care; medical information systems; statistical analysis; Gibbs sampler; big data health application; data mining; database linking attack; large-scale privacy-safe synthetic health data; linear-time algorithm; locality-sensitive hashing; nonparametric data synthesizing algorithm; parallel computation technique; public medicare claim dataset; statistical characteristic; Data privacy; Joining processes; Markov processes; Measurement; Medical services; Privacy; Synthesizers; Gibbs Sampler; Healthcare; Non-parametric; Privacy; Synthetic Data;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Healthcare Informatics (ICHI), 2013 IEEE International Conference on
  • Conference_Location
    Philadelphia, PA
  • Type

    conf

  • DOI
    10.1109/ICHI.2013.76
  • Filename
    6680524