• DocumentCode
    55425
  • Title

    Category-Based Infidelity Bounded Queries over Unstructured Data Streams

  • Author

    Bhide, Manish ; Ramamritham, K.

  • Author_Institution
    IBM Software Lab., Hyderabad, India
  • Volume
    25
  • Issue
    11
  • fYear
    2013
  • fDate
    Nov. 2013
  • Firstpage
    2448
  • Lastpage
    2462
  • Abstract
    We present the Caicos system that supports continuous infidelity bounded queries over a data stream, where each data item (of the stream) belongs to multiple categories. Caicos is made up of four primitives: Keywords, Queries, Data items, and Categories. A Category is a virtual entity consisting of all those data items that belong to it. The membership of a data item to a category is decided by evaluating a Boolean predicate (associated with each category) over the data item. Each data item and query in turn are associated with multiple keywords. Given a keyword query, unlike conventional unstructured data querying techniques that return the top-(K) documents, Caicos returns the top-(K) categories with infidelity less than the user specified infidelity bound. Caicos is designed to continuously track the evolving information present in a highly dynamic data stream. It, hence, computes the relevance of a category to the continuous keyword query using the data items occurring in the stream in the recent past (i.e., within the current "window"). To efficiently provide up-to-date answers to the continuous queries, Caicos needs to maintain the required metadata accurately. This requires addressing two subproblems: 1) Identifying the "right" metadata that needs to be updated for providing accurate results and 2) updating the metadata in an efficient manner. We show that the problem of identifying the right metadata can be further broken down into two subparts. We model the first subpart as an inequality constrained minimization problem and propose an innovative iterative algorithm for the same. The second subpart requires us to build an efficient dynamic programming-based algorithm, which helps us to find the right metadata that needs to be updated. Updating the metadata on multiple processors is a scheduling problem whose complexity is exponential in the length of the input. An approximate multiprocessor scheduling algorithm is, hence, proposed. Experimental evaluation o- Caicos using real-world dynamic data shows that Caicos is able to provide fidelity close to 100 percent using 45 percent less resources than the techniques proposed in the literature. This ability of Caicos to work accurately and efficiently even in scenarios with high data arrival rates makes it suitable for data intensive application domains.
  • Keywords
    data handling; dynamic programming; iterative methods; meta data; query processing; Boolean predicate; Caicos system; approximate multiprocessor scheduling algorithm; categories; category based infidelity bounded queries; data items; dynamic data stream; dynamic programming based algorithm; inequality constrained minimization problem; iterative algorithm; keyword query; keywords; metadata; queries; unstructured data querying techniques; unstructured data streams; virtual entity; Blogs; Companies; Engines; Fuels; Heuristic algorithms; Program processors; Search engines; Continuous query; category-based query; data stream; threshold queries;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2012.200
  • Filename
    6329890