Title :
DSDQuery DSI - Querying scientific data repositories with structured operators
Author :
Roee Ebenstein;Gagan Agrawal
Author_Institution :
Department of Computer Science and Engineering, The Ohio State University
Abstract :
Scientific data is often distributed through repositories that host a large number of files in formats such as NetCDF or HDF5. With recent and anticipated increases in the size of observational and simulation data, it is important to transport just the data that are of interest from a large distributed dataset. Unfortunately, existing portals provide limited querying interfaces - typically a set of predefined hard coded subsettings, limiting user´s querying flexibility. This paper describes a system that addresses this gap. The relational algebra is adapted for scientific array querying allowing us to adapt a subset of SQL for this domain, which enables nuanced subsetting conditions to be applied on a set of dataset files within a repository. A query processing algorithm extracts and collects data from relevant datasets, based on metadata that was earlier extracted using an automatic metadata extraction engine. Finally, the system stitches a new structured, NetCDF, file to be returned as a resultset, allowing the returned data to be used and analyzed by existing tools. The system has been extensively evaluated to show its ability to handle increasing data and/or number of files.
Keywords :
"Arrays","Algebra","Metadata","Distributed databases","Portals","Data mining"
Conference_Titel :
Big Data (Big Data), 2015 IEEE International Conference on
DOI :
10.1109/BigData.2015.7363790