Fast scalable selection algorithms for large scale data

Author

Thompson, Lee Parnell ; Weijia Xu ; Miranker, Daniel P.

Author_Institution

Dept. of Comput. Sci., Univ. of Texas at Austin, Austin, TX, USA

fYear

2013

fDate

6-9 Oct. 2013

Firstpage

412

Lastpage

420

Abstract

Selection finding, and its most common form median finding, are used as a measure of central tendency for problems in biology, databases, and graphics. These problems often require selection finding as a subcomponent where it can be called many times, and as such speed is important. The Map/Reduce framework has been shown to be an important tool for creating scalable applications. There are a number of valid implementations of the selection algorithms inside of a Map/Reduce framework, certain of which are compared in this paper. However, as the volume of data increases, subtle theoretical algorithmic implementation differences can lead to significant differences in practical application. Therefore, an efficient and scalable selection finding method has the potential to provide general benefit to a number of applications. This paper compares algorithms that have been redesigned or created for the Map/Reduce framework for the purpose of selection finding, or, finding the k-th ranked element in an unordered set. This paper takes the concepts used from two existing selection algorithms and translates them into a novel method using the Map/Reduce framework with two variations. Each approach uses a different methodology to reduce the total amount of workload needed for a selection. All the algorithms are compared together for scalability and efficiency in a computing cluster environment with up to 256 processing cores. The results show that the methods proposed in this paper outperform several common alternatives in identifying medians with Hadoop, including using sorting, Pig, and BinMedian methods. Our implementations are also available upon request.

Keywords

Big Data; distributed processing; public domain software; BinMedian method; Hadoop; MapReduce framework; Pig method; computing cluster environment; fast scalable selection algorithms; large scale data; sorting method; Algorithm design and analysis; Clustering algorithms; Computer science; Educational institutions; Libraries; Partitioning algorithms; Sorting; Hadoop; Map Reduce; Median Finding; Selection Algorithms;

fLanguage

English

Publisher

ieee

Conference_Titel

Big Data, 2013 IEEE International Conference on

Conference_Location

Silicon Valley, CA

Type

conf

DOI

10.1109/BigData.2013.6691602

Filename

6691602