Development of a Semi-synthetic Dataset as a Testbed for Big-Data Semantic Analytics

Author

Techentin, Robert ; Foti, Dora ; Li, Peng ; Daniel, E. ; Gilbert, Barry ; Holmes, David ; Al-Saffar, Sinan

Author_Institution

Mayo Clinic, Rochester, MN, USA

fYear

2014

fDate

16-18 June 2014

Firstpage

252

Lastpage

253

Abstract

We have developed a large semi-synthetic, semantically rich dataset, modeled after the medical record of a large medical institution. Using the highly diverse data.gov data repository and a multivariate data augmentation strategy, we can generate arbitrarily large semi-synthetic datasets which can be used to test new algorithms and computational platforms. The construction process and basic data characterization are described. The databases, as well as code for data collection, consolidation, and augmentation are available for distribution.

Keywords

Big Data; data analysis; medical information systems; relational databases; very large databases; big-data semantic analytics; data augmentation; data collection; data consolidation; data.gov data repository; medical institution; medical record; multivariate data augmentation strategy; semisynthetic dataset development; Benchmark testing; Complexity theory; Conferences; Distributed databases; Resource description framework; Semantics; RDF; big data; data.gov; graph computing; semantic representation;

fLanguage

English

Publisher

ieee

Conference_Titel

Semantic Computing (ICSC), 2014 IEEE International Conference on

Conference_Location

Newport Beach, CA

Print_ISBN

978-1-4799-4002-8

Type

conf

DOI

10.1109/ICSC.2014.45

Filename

6882033