Title :
Keyword Search over Dynamic Categorized Information
Author :
Bhide, Manish ; Chakaravarthy, Venkatesan T. ; Ramamritham, Krithi ; Roy, Prasan
Author_Institution :
IBM India Res. Lab., New Delhi
fDate :
March 29 2009-April 2 2009
Abstract :
Consider an information repository whose content is categorized. A data item (in the repository) can belong to multiple categories and new data is continuously added to the system. In this paper, we describe a system, CS*, which takes a keyword query and returns the relevant top-K categories. In contrast, traditional keyword search returns the top-K documents (i.e., data items) relevant to a user query. The need to dynamically categorize new data and also update the meta-data required for fast responses to user queries poses interesting challenges. The brute force approach of updating the meta-data by comparing each new data item with all the categories is impractical due to (i) the large cost involved in finding the categories associated with a data item and (ii) the high rate of arrival of new data items. We show that a sampling based approach which provides statistical guarantees on the reported results is also impracticable. We hence develop the CS* approach whose effectiveness results from its ability to focus on a strategically chosen subset of categories on the one hand and a subset of new data on the other. Given a query, CS* finds the top-K categories with high accuracy even in time-constrained situations. An experimental evaluation of the CS* system using real world data shows that it can easily achieve accuracy in excess of 90%, whereas other approaches demand at least 57% more resources (i.e., processing power), for providing similar results. Our experimental results also show that, contrary to expectations, if the rate of arrival of data items doubles, whereas CS* continues to provide high accuracy without a significant increase in resources, other approaches require more than double the number of resources.
Keywords :
document handling; information retrieval; meta data; dynamic categorized information; information repository; keyword query; keyword search; meta-data; queries poses; time-constrained situations; top-K categories; top-K documents; Blogs; Costs; Data engineering; Data systems; Educational institutions; Keyword search; Sampling methods; Search engines; Stock markets; USA Councils; Dynamic Data; Keyword Search; categorized search;
Conference_Titel :
Data Engineering, 2009. ICDE '09. IEEE 25th International Conference on
Conference_Location :
Shanghai
Print_ISBN :
978-1-4244-3422-0
Electronic_ISBN :
1084-4627
DOI :
10.1109/ICDE.2009.91