DocumentCode :
3322849
Title :
Self-Join Size Estimation in Large-scale Distributed Data Systems
Author :
Pitoura, Theoni ; Triantafillou, P.
Author_Institution :
Inst. of Res. Acad. Comput. Technol., Patras Univ., Patras
fYear :
2008
fDate :
7-12 April 2008
Firstpage :
764
Lastpage :
773
Abstract :
In this work we tackle the open problem of self-join size (SJS) estimation in a large-scale distributed data system, where tuples of a relation are distributed over data nodes which comprise an overlay network. Our contributions include adaptations of five well-known SJS estimation centralized techniques (coined sequential, cross-sampling, adaptive, bifocal, and sample-count) to the network environment and a novel technique which is based on the use of the Gini coefficient. We develop analyses showing how Gini estimations can lead to estimations of the underlying Zipfian or power-law value distributions. We further contribute distributed sampling algorithms that can estimate accurately and efficiently the Gini coefficient. Finally, we provide detailed experimental evidence testifying for the claimed increased accuracy, precision, and efficiency of the proposed SJS estimation method, compared to the other methods. The proposed approach is the only one to ensure high efficiency, precision, and accuracy regardless of the skew of the underlying data.
Keywords :
distributed databases; peer-to-peer computing; sampling methods; very large databases; Gini coefficient; distributed sampling algorithms; large-scale distributed data systems; network environment; overlay network; power-law value distributions; self-join size estimation; Data mining; Data systems; Distributed computing; Frequency estimation; Information retrieval; Large-scale systems; Query processing; Sampling methods; Testing; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on
Conference_Location :
Cancun
Print_ISBN :
978-1-4244-1836-7
Electronic_ISBN :
978-1-4244-1837-4
Type :
conf
DOI :
10.1109/ICDE.2008.4497485
Filename :
4497485
Link To Document :
بازگشت