• DocumentCode
    2079421
  • Title

    WikiAnalytics: Ad-hoc querying of highly heterogeneous structured data

  • Author

    Balmin, Andrey ; Curtmola, Emiran

  • Author_Institution
    IBM Almaden Res. Center, San Jose, CA, USA
  • fYear
    2010
  • fDate
    1-6 March 2010
  • Firstpage
    1145
  • Lastpage
    1148
  • Abstract
    Searching and extracting meaningful information out of highly heterogeneous datasets is a hot topic that received a lot of attention. However, the existing solutions are based on either rigid complex query languages (e.g., SQL, XQuery/XPath) which are hard to use without full schema knowledge, without an expert user, and which require up-front data integration. At the other extreme, existing solutions employ keyword search queries over relational databases, as well as over semistructured data, which are too imprecise to specify exactly the user´s intent. To address these limitations, we propose an alternative search paradigm in order to derive tables of precise and complete results from a very sparse set of heterogeneous records. Our approach allows users to disambiguate search results by navigation along conceptual dimensions that describe the records. Therefore, we cluster documents based on fields and values that contain the query keywords. We build a universal navigational lattice (UNL) over all such discovered clusters. Conceptually, the UNL encodes all possible ways to group the documents in the data corpus based on where the keywords hit. We describe, WikiAnalytics, a system that facilitates data extraction from the Wikipedia infobox collection. WikiAnalytics provides a dynamic and intuitive interface that lets the average user explore the search results and construct homogeneous structured tables, which can be further queried and mashed up (e.g., filtered and aggregated) using the conventional tools.
  • Keywords
    query languages; relational databases; user interfaces; WikiAnalytics system; Wikipedia infobox collection; ad-hoc querying; data extraction; data integration; expert user; highly heterogeneous structured data; keyword search queries; query languages; relational databases; schema knowledge; search results disambiguation; universal navigational lattice; user interface; Catalogs; Data mining; Database languages; HTML; Keyword search; Lattices; Navigation; Query processing; Relational databases; Wikipedia;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2010 IEEE 26th International Conference on
  • Conference_Location
    Long Beach, CA
  • Print_ISBN
    978-1-4244-5445-7
  • Electronic_ISBN
    978-1-4244-5444-0
  • Type

    conf

  • DOI
    10.1109/ICDE.2010.5447751
  • Filename
    5447751