Title :
Towards eliminating random I/O in hash joins
Author :
Lo, Ming-Ling ; Ravishankar, Chinya V.
Author_Institution :
Dept. of Electr. Eng. & Comput. Sci., Michigan Univ., Ann Arbor, MI, USA
fDate :
26 Feb-1 Mar 1996
Abstract :
The widening performance gap between CPUs and disks is significant for hash join performance. Most current hash join methods try to reduce the volume of data transferred between memory and disk. In this paper, we try to reduce hash-join times by reducing random I/O. We study how current algorithms incur random I/O, and propose a new hash join method, Seq+, that converts much of the random I/O to sequential I/O. Seq+ uses a new organization for hash buckets on a disk, and larger input and output buffer sizes. We introduce the technique of batch writes to reduce the bucket-write cost, and the concepts of write- and read-groups of hash buckets to reduce the bucket-read cost. We derive a cost model for our method and present formulas for choosing various algorithm parameters, including input and output buffer sizes. Our performance study shows that the new hash join method performs many times better than current algorithms under various environments. Since our cost functions underestimate the cost of current algorithms and overestimate the cost of Seq+, the actual performance gain of Seq+ is likely to be even greater
Keywords :
buffer storage; file organisation; input-output programs; random processes; software performance evaluation; CPU performance; Seq+ hash join method; algorithm parameter selection; batch writes; bucket-read cost; bucket-write cost; buffer sizes; cost model; disk performance; hash bucket organization; hash join performance; random input/output elimination; read-groups; sequential I/O; write-groups; Costs; Geoscience; Spatial databases; Stability; Writing;
Conference_Titel :
Data Engineering, 1996. Proceedings of the Twelfth International Conference on
Conference_Location :
New Orleans, LA
Print_ISBN :
0-8186-7240-4
DOI :
10.1109/ICDE.1996.492191