DocumentCode :
168724
Title :
Supporting Queries and Analyses of Large-Scale Social Media Data with Customizable and Scalable Indexing Techniques over NoSQL Databases
Author :
Xiaoming Gao ; Qiu, Jian
Author_Institution :
Sch. of Inf. & Comput., Indiana Univ., Bloomington, IN, USA
fYear :
2014
fDate :
26-29 May 2014
Firstpage :
587
Lastpage :
590
Abstract :
Social media data analysis demonstrates two special characteristics in Big Data processing. First, most analyses focus on data subsets related to specific social events or activities instead of the whole dataset. Second, analysis workflows consist of multiple stages, and algorithms applied in each stage may use different computation and communication patterns depending on processing frameworks. This paper presents our efforts in supporting the data storage and processing requirements for such characteristics. To achieve efficient queries about target data subsets, we propose a general customizable and scalable indexing framework that can be built over distributed NoSQL databases. This framework allows users to define suitable customized index structures for their query patterns against social media data, and supports scalable indexing of both historical and streaming data. We implement this framework on HBase, and name it IndexedHBase. Starting from IndexedHBase, we build a distributed analysis stack based on YARN to support analysis algorithms using different processing frameworks, such as Hadoop MapReduce, Harp, and Giraph. This analysis stack is used to host the Truthy social media data observatory, and we have applied the customized index structures in supporting both query evaluation and sophisticated analysis algorithms. Performance tests show that our solutions outperform implementations using both direct raw data scans and current indexing mechanisms in existing NoSQL databases.
Keywords :
Big Data; SQL; data analysis; indexing; query processing; social networking (online); storage management; Big Data processing; Giraph; Hadoop MapReduce; Harp; Truthy social media data observatory; YARN; analysis workflows; communication patterns; computation patterns; customizable indexing framework; customizable indexing techniques; customized index structures; data storage; data subsets; distributed NoSQL databases; distributed analysis stack; historical data indexing; indexedHBase; large-scale social media data; processing requirements; query evaluation; query patterns; scalable indexing framework; scalable indexing techniques; social activities; social events; social media data analysis; streaming data indexing; Algorithm design and analysis; Data analysis; Distributed databases; Indexing; Media; NoSQL databases; YARN; customizable and scalable indexing; social media data analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
Conference_Location :
Chicago, IL
Type :
conf
DOI :
10.1109/CCGrid.2014.57
Filename :
6846507
Link To Document :
بازگشت