DocumentCode
659453
Title
Fast scalable selection algorithms for large scale data
Author
Thompson, Lee Parnell ; Weijia Xu ; Miranker, Daniel P.
Author_Institution
Dept. of Comput. Sci., Univ. of Texas at Austin, Austin, TX, USA
fYear
2013
fDate
6-9 Oct. 2013
Firstpage
412
Lastpage
420
Abstract
Selection finding, and its most common form median finding, are used as a measure of central tendency for problems in biology, databases, and graphics. These problems often require selection finding as a subcomponent where it can be called many times, and as such speed is important. The Map/Reduce framework has been shown to be an important tool for creating scalable applications. There are a number of valid implementations of the selection algorithms inside of a Map/Reduce framework, certain of which are compared in this paper. However, as the volume of data increases, subtle theoretical algorithmic implementation differences can lead to significant differences in practical application. Therefore, an efficient and scalable selection finding method has the potential to provide general benefit to a number of applications. This paper compares algorithms that have been redesigned or created for the Map/Reduce framework for the purpose of selection finding, or, finding the k-th ranked element in an unordered set. This paper takes the concepts used from two existing selection algorithms and translates them into a novel method using the Map/Reduce framework with two variations. Each approach uses a different methodology to reduce the total amount of workload needed for a selection. All the algorithms are compared together for scalability and efficiency in a computing cluster environment with up to 256 processing cores. The results show that the methods proposed in this paper outperform several common alternatives in identifying medians with Hadoop, including using sorting, Pig, and BinMedian methods. Our implementations are also available upon request.
Keywords
Big Data; distributed processing; public domain software; BinMedian method; Hadoop; MapReduce framework; Pig method; computing cluster environment; fast scalable selection algorithms; large scale data; sorting method; Algorithm design and analysis; Clustering algorithms; Computer science; Educational institutions; Libraries; Partitioning algorithms; Sorting; Hadoop; Map Reduce; Median Finding; Selection Algorithms;
fLanguage
English
Publisher
ieee
Conference_Titel
Big Data, 2013 IEEE International Conference on
Conference_Location
Silicon Valley, CA
Type
conf
DOI
10.1109/BigData.2013.6691602
Filename
6691602
Link To Document