DocumentCode :
3078203
Title :
Optimize Parallel Data Access in Big Data Processing
Author :
Jiangling Yin ; Jun Wang
Author_Institution :
EECS Dept., Univ. of Central Florida, Orlando, FL, USA
fYear :
2015
fDate :
4-7 May 2015
Firstpage :
721
Lastpage :
724
Abstract :
Recent years the Hadoop Distributed File System(HDFS) has been deployed as the bedrock for many parallel big data processing systems, such as graph processing systems, MPI-based parallel programs and scala/java-based Spark frameworks, which can efficiently support iterative and interactive data analysis in memory. The first part of my dissertation mainly focuses on studying parallel data accession distributed file systems, e.g, HDFS. Since the distributed I/O resources and global data distribution are often not taken into consideration, the data requests from parallel processes/executors will unfortunately be served in a remoter imbalanced fashion on the storage servers. In order to address these problems, we develop I/O middleware systems and matching-based algorithms to map parallel data requests to storage servers such that local and balanced data access can be achieved. The last part of my dissertation presents our plans to improve the performance of interactive data access in big data analysis. Specifically, most interactive analysis programs will scan through the entire data set regardless of which data is actually required. We plan to develop a content-aware method to quickly access required data without this laborious scanning process.
Keywords :
Big Data; Java; data analysis; input-output programs; interactive systems; middleware; parallel processing; HDFS; Hadoop distributed file system; IO middleware systems; MPI-based parallel programs; balanced data access; big data analysis; big data processing; content-aware method; distributed IO resources; global data distribution; graph processing systems; interactive data analysis; iterative data analysis; laborious scanning process; local data access; matching-based algorithms; optimize parallel data access; parallel big data processing systems; parallel data requests; scala-java-based Spark frameworks; storage servers; Bandwidth; Big data; Data visualization; Distributed databases; Middleware; Simultaneous localization and mapping;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on
Conference_Location :
Shenzhen
Type :
conf
DOI :
10.1109/CCGrid.2015.168
Filename :
7152541
Link To Document :
بازگشت