Title :
Development of a Semi-synthetic Dataset as a Testbed for Big-Data Semantic Analytics
Author :
Techentin, Robert ; Foti, Dora ; Li, Peng ; Daniel, E. ; Gilbert, Barry ; Holmes, David ; Al-Saffar, Sinan
Author_Institution :
Mayo Clinic, Rochester, MN, USA
Abstract :
We have developed a large semi-synthetic, semantically rich dataset, modeled after the medical record of a large medical institution. Using the highly diverse data.gov data repository and a multivariate data augmentation strategy, we can generate arbitrarily large semi-synthetic datasets which can be used to test new algorithms and computational platforms. The construction process and basic data characterization are described. The databases, as well as code for data collection, consolidation, and augmentation are available for distribution.
Keywords :
Big Data; data analysis; medical information systems; relational databases; very large databases; big-data semantic analytics; data augmentation; data collection; data consolidation; data.gov data repository; medical institution; medical record; multivariate data augmentation strategy; semisynthetic dataset development; Benchmark testing; Complexity theory; Conferences; Distributed databases; Resource description framework; Semantics; RDF; big data; data.gov; graph computing; semantic representation;
Conference_Titel :
Semantic Computing (ICSC), 2014 IEEE International Conference on
Conference_Location :
Newport Beach, CA
Print_ISBN :
978-1-4799-4002-8
DOI :
10.1109/ICSC.2014.45