• DocumentCode
    1787463
  • Title

    Development of a Semi-synthetic Dataset as a Testbed for Big-Data Semantic Analytics

  • Author

    Techentin, Robert ; Foti, Dora ; Li, Peng ; Daniel, E. ; Gilbert, Barry ; Holmes, David ; Al-Saffar, Sinan

  • Author_Institution
    Mayo Clinic, Rochester, MN, USA
  • fYear
    2014
  • fDate
    16-18 June 2014
  • Firstpage
    252
  • Lastpage
    253
  • Abstract
    We have developed a large semi-synthetic, semantically rich dataset, modeled after the medical record of a large medical institution. Using the highly diverse data.gov data repository and a multivariate data augmentation strategy, we can generate arbitrarily large semi-synthetic datasets which can be used to test new algorithms and computational platforms. The construction process and basic data characterization are described. The databases, as well as code for data collection, consolidation, and augmentation are available for distribution.
  • Keywords
    Big Data; data analysis; medical information systems; relational databases; very large databases; big-data semantic analytics; data augmentation; data collection; data consolidation; data.gov data repository; medical institution; medical record; multivariate data augmentation strategy; semisynthetic dataset development; Benchmark testing; Complexity theory; Conferences; Distributed databases; Resource description framework; Semantics; RDF; big data; data.gov; graph computing; semantic representation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Semantic Computing (ICSC), 2014 IEEE International Conference on
  • Conference_Location
    Newport Beach, CA
  • Print_ISBN
    978-1-4799-4002-8
  • Type

    conf

  • DOI
    10.1109/ICSC.2014.45
  • Filename
    6882033