Title :
A Tree-Based Framework for Difference Summarization
Author :
Jin, Ruoming ; Breitbart, Yuri ; Li, Rong
Author_Institution :
Dept. of Comput. Sci., Kent State Univ., Kent, OH, USA
Abstract :
Understanding the differences between two datasets is a fundamental data mining question and is also ubiquitously important across many real world scientific applications. In this paper, we propose a tree-based framework to provide a parsimonious explanation of the difference between two distributions based on rigorous two-sample statistical test. We develop two efficient approaches. The first one is a dynamic programming approach that finds a minimal number of data subsets that describe the difference between two data sets. The second one is a greedy approach that approximates the dynamic programming approach. We employ the well-known Friedman´s MST (minimal spanning tree) statistics for two-sample statistical tests in our summarization tree construction, and develop novel techniques to speedup its computational procedure. We performed a detailed experimental evaluation on both real and synthetic datasets and demonstrated the effectiveness of our tree-summarization approach.
Keywords :
data mining; dynamic programming; statistical analysis; trees (mathematics); Friedman minimal spanning tree statistics; data mining; data subsets; datasets; difference summarization; dynamic programming approach; tree-based framework; two-sample statistical test; Application software; Computer science; Data mining; Drugs; Dynamic programming; Marketing and sales; Multidimensional systems; Statistical analysis; Testing; USA Councils; Chi-square test; Friedman-Rafsky test; Kolmogorov-Smirnov test; difference summarization; minimal spanning tree; two-sample test;
Conference_Titel :
Data Mining, 2009. ICDM '09. Ninth IEEE International Conference on
Conference_Location :
Miami, FL
Print_ISBN :
978-1-4244-5242-2
Electronic_ISBN :
1550-4786
DOI :
10.1109/ICDM.2009.68