Abstract :
Data summarization is an effective approach to dealing with the "big data" problem. While data summarization problems traditionally have been studied is the streaming model, the focus is starting to shift to distributed models, as distributed/parallel computation seems to be the only viable way to handle today\´s massive data sets. In this paper, we study ε-approximations, a classical data summary that, intuitively speaking, preserves approximately the density of the underlying data set over a certain range space. We consider the problem of computing ε-approximations for a data set which is held jointly by k players, and give general communication upper and lower bounds that hold for any range space whose discrepancy is known.
Keywords :
Big Data; communication complexity; message passing; Big Data problem; communication complexity; data set density; data summarization problems; data summary; distributed ε-approximations; distributed computation; distributed models; general communication lower bounds; general communication upper bounds; massive data set handling; parallel computation; range space; Approximation methods; Complexity theory; Computational modeling; Data models; Distributed databases; Protocols; Standards; ε-approximations; communication complexity; discrepancy; distributed data;