Title :
Data streaming algorithms for the Kolmogorov-Smirnov test
Author_Institution :
Dept. of Math. &
Abstract :
We propose space-efficient algorithms for performing the Kolmogorov-Smirnov test on streaming data. The Kolmogorov-Smirnov test is a non-parametric test for measuring the strength of a hypothesis that some data is drawn from a fixed distribution (one-sample test), or that two sets of data are drawn from the same distribution (two-sample test). Unlike some other tests, Kolmogorov-Smirnov does not assume that the distribution has a known form (e.g., it is normal), and in the two-sample case it need not know anything about the distribution, other than that it is continuous. Motivated by the challenges of big data, we present algorithms for both the one-sample and the two-sample tests for data processed in a stream. We demonstrate the accuracy of our algorithms via extensive experimentation on both real and synthetic datasets. We show that our algorithms are superior to sampling and that they accurately perform the test with several orders of magnitude reduction in data.
Keywords :
"Extraterrestrial measurements","Distribution functions","Big data","Internet","Green products","Computational modeling","Standards"
Conference_Titel :
Big Data (Big Data), 2015 IEEE International Conference on
DOI :
10.1109/BigData.2015.7363746