• DocumentCode
    3334308
  • Title

    Quarrying dataspaces: Schemaless profiling of unfamiliar information sources

  • Author

    Howe, Bill ; Maier, David ; Rayner, Nicolas ; Rucker, James

  • Author_Institution
    Center for Coastal Margin Obs. & Prediction, Oregon Health & Sci. Univ., Beaverton, OR
  • fYear
    2008
  • fDate
    7-12 April 2008
  • Firstpage
    270
  • Lastpage
    277
  • Abstract
    Traditional data integration and analysis approaches tend to assume intimate familiarity with the structure, semantics, and capabilities of the available information sources before applicable tools can be used effectively. This assumption often does not hold in practice. We introduce dataspace profiling as the cardinal activity when beginning a project in an unfamiliar dataspace. Dataspace profiling is an analysis of the structures and properties exposed by an information source, allowing 1) assessment of the utility and importance of the information source as a whole, 2) assessment of compatibility with the services of a dataspace support platform, and 3) determination and externalization of structure in preparation for specific data applications. In this paper, we define dataspace profiling and articulate requirements for dataspace profilers. We then describe the Quarry system, which offers a generic browse-and-query interface to support dataspace profiling activities, including path profiling, over a variety of data sources with minimal setup costs and minimal a priori assumptions. We show that the mechanisms used in Quarry deliver strong performance in large-scale applications. Specifically, we use Quarry to efficiently profile 1) a detailed standard for medication nomenclature supplied under a generic schema and 2) the metadata for an environmental observation and forecasting system, and conclude that in these contexts Quarry offers advantages over existing tools.
  • Keywords
    meta data; query processing; data analysis approaches; data integration approaches; dataspace quarrying; medication nomenclature standard; metadata; quarry system; query processing; schemaless dataspace profiling; unfamiliar information sources; Computer science; Costs; Data analysis; Data mining; Information analysis; Large-scale systems; Relational databases; Resource description framework; Sea measurements; Unified modeling language;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering Workshop, 2008. ICDEW 2008. IEEE 24th International Conference on
  • Conference_Location
    Cancun
  • Print_ISBN
    978-1-4244-2161-9
  • Electronic_ISBN
    978-1-4244-2162-6
  • Type

    conf

  • DOI
    10.1109/ICDEW.2008.4498331
  • Filename
    4498331