DocumentCode :
3077462
Title :
Parallel Clustering of High-Dimensional Social Media Data Streams
Author :
Xiaoming Gao ; Ferrara, Emilio ; Qiu, Judy
Author_Institution :
Sch. of Inf. & Comput., Indiana Univ., Bloomington, IN, USA
fYear :
2015
fDate :
4-7 May 2015
Firstpage :
323
Lastpage :
332
Abstract :
We introduce Cloud DIKW (Data, Information, Knowledge, Wisdom) as an analysis environment supporting scientific discovery through integrated parallel batch and streaming processing, and apply it to one representative domain application: social media data stream clustering. In this context, recent work demonstrated that high-quality clusters can be generated by representing the data points using high-dimensional vectors that reflect textual content and social network information. However, due to the high cost of similarity computation, sequential implementations of even single-pass algorithms cannot keep up with the speed of real-world streams. This paper presents our efforts in meeting the constraints of realtimesocial media stream clustering through parallelization in Cloud DIKW. Specifically, we focus on two system-level issues. Firstly, most stream processing engines such as Apache Storm organize distributed workers in the form of a directed acyclic graph (DAG), which makes it difficult to dynamically synchronize the state of parallel clustering workers. We tackle this challenge by creating a separate synchronization channel using a pub-sub messaging system (ActiveMQ in our case). Secondly, due to the sparsity of the high-dimensional vectors, the size of cancroids grows quickly as new data points are assigned tithe clusters. As a result, traditional synchronization that directly broadcasts cluster cancroids becomes too expensive and limits the scalability of the parallel algorithm. We address this problem by communicating only dynamic changes of the clusters rather than the whole centred vectors. Our algorithm under Cloud DIKWcan process the Twitter 10% data stream ("gardenhose") in realtimewith 96-way parallelism. By natural improvements to CloudDIKW, including advanced collective communication techniques developed in our Harp project, we will be able to process the full Twitter data stream in real-time with 1000-way parallelism. Our use of powerful general sof- ware subsystems will enable many other applications that need integration of streaming and batch data analytics.
Keywords :
cloud computing; directed graphs; media streaming; message passing; middleware; parallel algorithms; pattern clustering; social networking (online); synchronisation; ActiveMQ; Cloud DIKW; DAG; data-information-knowledge-wisdom; directed acyclic graph; high-dimensional social media data streams; parallel algorithm; parallel batch processing; parallel clustering workers; pub-sub messaging system; social media data stream clustering; stream processing engines; synchronization channel; Algorithm design and analysis; Clustering algorithms; Heuristic algorithms; Media; Storms; Synchronization; Twitter; high-dimensional data; parallel algorithms; social media data stream clustering; stream processing engines; synchronization strategies;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on
Conference_Location :
Shenzhen
Type :
conf
DOI :
10.1109/CCGrid.2015.19
Filename :
7152498
Link To Document :
بازگشت