مرکز منطقه ای اطلاع رساني علوم و فناوري - Traversal optimizations and analysis for large graph clustering

Abstract :

Graph is an extremely versatile data structure in terms of its expressiveness and flexibility to model a range of real life phenomenon, such as social, biological, sensor, and computer networks. Finding groups of vertices based on their similarity is the fundamental graph mining task to get useful insights. The existing methods suffer from scalability issues due to enormous computations of an exact similarity estimation. Therefore, we introduce Collaborative Similarity Measure (CSM) based on shortest path strategy, instead of all paths, to define structural and semantic relevance among vertices efficiently. We evaluate this measure for personalized email community detection as an application scenario. However, an abundance of structural information has resulted in non-trivial graph traversals. Shortcut construction is among the utilized techniques implemented for efficient shortest path (SP) traversals on graphs. The shortcut construction, being a computationally intensive task, required to be exclusive and offline, often produces unnecessary auxiliary data. To overcome this issue, we present Shortest Path Overlapped Region (SPORE), a performance-based initiative that improves the shortcut construction performance by exploiting SP overlapped regions. Path overlapping with empirical analysis has been overlooked by shortcut construction systems. SPORE avails this opportunity and provides a solution by constructing auxiliary shortcuts incrementally, using SP trees during traversals, instead of an exclusive step. SPORE is exposed to a graph clustering task, which requires extensive graph traversals to group similar vertices together, for realistic implications. We further suggest an optimization strategy to accelerate the performance of the clustering process using confined subgraph traversals. Leveraging the SPORE with multiple SP computations consistently reduces the latency of the entire clustering process. A parameter-free graph clustering with scalable graph trave- sal strategy for a billion scale graph remain an open issue.

Keywords :

data mining; electronic mail; graph theory; optimisation; pattern clustering; CSM; SP trees; SPORE; collaborative similarity measure; confined subgraph traversals; graph mining task; nontrivial graph traversals; parameter-free graph clustering; personalized e-mail community detection; shortest path graph traversals; shortest path overlapped region; traversal optimizations; versatile data structure; Clustering algorithms; Collaboration; Communities; Data mining; Electronic mail; Semantics; Social network services;