• DocumentCode
    11363
  • Title

    Searching Dimension Incomplete Databases

  • Author

    Wei Cheng ; Xiaoming Jin ; Jian-Tao Sun ; Xuemin Lin ; Xiang Zhang ; Wei Wang

  • Author_Institution
    Dept. of Comput. Sci., Univ. of North Carolina at Chapel Hill, Carrboro, NC, USA
  • Volume
    26
  • Issue
    3
  • fYear
    2014
  • fDate
    Mar-14
  • Firstpage
    725
  • Lastpage
    738
  • Abstract
    Similarity query is a fundamental problem in database, data mining and information retrieval research. Recently, querying incomplete data has attracted extensive attention as it poses new challenges to traditional querying techniques. The existing work on querying incomplete data addresses the problem where the data values on certain dimensions are unknown. However, in many real-life applications, such as data collected by a sensor network in a noisy environment, not only the data values but also the dimension information may be missing. In this work, we propose to investigate the problem of similarity search on dimension incomplete data. A probabilistic framework is developed to model this problem so that the users can find objects in the database that are similar to the query with probability guarantee. Missing dimension information poses great computational challenge, since all possible combinations of missing dimensions need to be examined when evaluating the similarity between the query and the data objects. We develop the lower and upper bounds of the probability that a data object is similar to the query. These bounds enable efficient filtering of irrelevant data objects without explicitly examining all missing dimension combinations. A probability triangle inequality is also employed to further prune the search space and speed up the query process. The proposed probabilistic framework and techniques can be applied to both whole and subsequence queries. Extensive experimental results on real-life data sets demonstrate the effectiveness and efficiency of our approach.
  • Keywords
    data mining; database management systems; probability; query processing; data mining; dimension incomplete databases; efficient filtering; information retrieval research; probabilistic framework; probability triangle inequality; search space; similarity query; similarity search; Educational institutions; Probabilistic logic; Query processing; Random variables; Time series analysis; Upper bound; Dimension incomplete database; similarity search; whole sequence query;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2013.14
  • Filename
    6412668