مرکز منطقه ای اطلاع رساني علوم و فناوري - Record-aware compression for big textual data analysis acceleration

Abstract :

Big data analysis technologies are becoming more widely used in industry. The ever-increasing data volume, however, puts data analytic platforms such as Hadoop under constant pressure. Several compression methods have been made available on the Hadoop platform to effectively reduce data size and efficiently deliver data between cluster nodes. In the Hadoop context, compressed data can be categorized as splittable or non-splittable. Working with non-splittable data conflicts with the goal of parallelism. In addition, the current realization of splittable data by indexing is potentially harmful to the data locality property. To this end, we introduce the Record-aware Compression (RaC) scheme that makes the compressed contents splittable, uses a lightweight Hadoop Record Reader, and preserves the parallelism and data locality properties as much as possible. We evaluate RaC using a set of classical MapReduce jobs with a collection of well-known datasets from companies such as Google, Yahoo!, and Amazon. The experimental results show an average 24% improvement on analysis performance and up to 75% data size reduction.