Title :
WikiAnalytics: Ad-hoc querying of highly heterogeneous structured data
Author :
Balmin, Andrey ; Curtmola, Emiran
Author_Institution :
IBM Almaden Res. Center, San Jose, CA, USA
Abstract :
Searching and extracting meaningful information out of highly heterogeneous datasets is a hot topic that received a lot of attention. However, the existing solutions are based on either rigid complex query languages (e.g., SQL, XQuery/XPath) which are hard to use without full schema knowledge, without an expert user, and which require up-front data integration. At the other extreme, existing solutions employ keyword search queries over relational databases, as well as over semistructured data, which are too imprecise to specify exactly the user´s intent. To address these limitations, we propose an alternative search paradigm in order to derive tables of precise and complete results from a very sparse set of heterogeneous records. Our approach allows users to disambiguate search results by navigation along conceptual dimensions that describe the records. Therefore, we cluster documents based on fields and values that contain the query keywords. We build a universal navigational lattice (UNL) over all such discovered clusters. Conceptually, the UNL encodes all possible ways to group the documents in the data corpus based on where the keywords hit. We describe, WikiAnalytics, a system that facilitates data extraction from the Wikipedia infobox collection. WikiAnalytics provides a dynamic and intuitive interface that lets the average user explore the search results and construct homogeneous structured tables, which can be further queried and mashed up (e.g., filtered and aggregated) using the conventional tools.
Keywords :
query languages; relational databases; user interfaces; WikiAnalytics system; Wikipedia infobox collection; ad-hoc querying; data extraction; data integration; expert user; highly heterogeneous structured data; keyword search queries; query languages; relational databases; schema knowledge; search results disambiguation; universal navigational lattice; user interface; Catalogs; Data mining; Database languages; HTML; Keyword search; Lattices; Navigation; Query processing; Relational databases; Wikipedia;
Conference_Titel :
Data Engineering (ICDE), 2010 IEEE 26th International Conference on
Conference_Location :
Long Beach, CA
Print_ISBN :
978-1-4244-5445-7
Electronic_ISBN :
978-1-4244-5444-0
DOI :
10.1109/ICDE.2010.5447751