• DocumentCode
    1761398
  • Title

    Are Data Sets Like Documents?: Evaluating Similarity-Based Ranked Search over Scientific Data

  • Author

    Megler, V.M. ; Maier, David

  • Author_Institution
    Dept. of Comput. Sci., Portland State Univ., Portland, OR, USA
  • Volume
    27
  • Issue
    1
  • fYear
    2015
  • fDate
    Jan. 1 2015
  • Firstpage
    32
  • Lastpage
    45
  • Abstract
    The past decade has seen a dramatic increase in the amount of data captured and made available to scientists for research. This increase amplifies the difficulty scientists face in finding the data most relevant to their information needs. In prior work, we hypothesized that Information Retrieval-style ranked search can be applied to data sets to help a scientist discover the most relevant data amongst the thousands of data sets in many formats, much like text-based ranked search helps users make sense of the vast number of Internet documents. To test this hypothesis, we explored the use of ranked search for scientific data using an existing multi-terabyte observational archive as our test-bed. In this paper, we investigate whether the concept of varying relevance, and therefore ranked search, applies to numeric data-that is, are data sets are enough like documents for Information Retrieval techniques and evaluation measures to apply? We present a user study that demonstrates that data set similarity resonates with users as a basis for relevance and, therefore, for ranked search. We evaluate a prototype implementation of ranked search over data sets with a second user study and demonstrate that ranked search improves a scientist´s ability to find needed data.
  • Keywords
    Internet; information needs; information retrieval; natural sciences computing; text analysis; Internet documents; data sets; information needs; information retrieval-style ranked search; multiterabyte observational archive; numeric data; scientific data; similarity-based ranked search evaluation; text-based ranked search; Catalogs; Geospatial analysis; Ocean temperature; Search problems; Sociology; Statistics; Temperature distribution; Scientific databases; information retrieval and relevance; similarity search;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2014.2320737
  • Filename
    6807734