Author :
Matsubara, Yasuko ; Sakurai, Yasushi ; Yoshikawa, Masatoshi
Abstract :
Distribution data naturally arise in countless domains, such as meteorology, biology, geology, industry and economics. However, relatively little attention has been paid to data mining for large distribution sets. Given n distributions of multiple categories and a query distribution Q, we want to find similar clouds (i.e., distributions), to discover patterns, rules and outlier clouds. For example, consider the numerical case of sales of items, where, for each item sold, we record the unit price and quantity; then, each customer is represented as a distribution of 2d points (one for each item he/she bought). We want to find similar users, e.g., for market segmentation, anomaly/fraud detection. We propose to address this problem and present D-search, which includes fast and effective algorithms for similarity search in large distribution datasets. Our main contributions are (1) approximate KL divergence, which can speed up cloud-similarity computations, (2) multi-step sequential scan, which efficiently prunes a significant number of search candidates and leads to a direct reduction in the search cost. We also introduce an extended version of D-search: (3) time-series distribution mining, which finds similar subsequences in time-series distribution datasets. Extensive experiments on real multi-dimensional datasets show that our solution achieves up to 2,300 faster wall-clock time over the naive implementation while it does not sacrifice accuracy.
Keywords :
data mining; query processing; time series; very large databases; D-search; KL divergence; cloud similarity; data mining; distribution search; large distribution dataset; multidimensional dataset; multistep sequential scan; pattern discovery; query distribution; similarity search; time-series distribution mining; Biomedical imaging; Clouds; Data mining; Electroencephalography; Geology; Industrial economics; Kinetic energy; Meteorology; Multimedia databases; Power generation economics; Distribution sets; KL divergence; Singular value decomposition;