On dimensionality reduction of massive graphs for indexing and retrieval

Author

Aggarwal, Charu C. ; Wang, Haixun

Author_Institution

IBM T. J. Watson Res. Center, Hawthorne, NY, USA

fYear

2011

fDate

11-16 April 2011

Firstpage

1091

Lastpage

1102

Abstract

In this paper, we will examine the problem of dimensionality reduction of massive disk-resident data sets. Graph mining has become important in recent years because of its numerous applications in community detection, social networking, and web mining. Many graph data sets are defined on massive node domains in which the number of nodes in the underlying domain is very large. As a result, it is often difficult to store and hold the information necessary in order to retrieve and index the data. Most known methods for dimensionality reduction are effective only for data sets defined on modest domains. Furthermore, while the problem of dimensionality reduction is most relevant to the problem of massive data sets, these algorithms are inherently not designed for the case of disk-resident data in terms of the order in which the data is accessed on disk. This is a serious limitation which restricts the applicability of current dimensionality reduction methods. Furthermore, since dimensionality reduction methods are typically designed for database applications such as indexing, it is important to design the underlying data reduction method, so that it can be effectively used for such applications. In this paper, we will examine the difficult problem of dimensionality reduction of graph data in the difficult case in which the underlying number of nodes are very large and the data set is disk-resident. We will propose an effective sampling algorithm for dimensionality reduction and show how to perform the dimensionality reduction in a limited number of passes on disk. We will also design the technique to be highly interpretable and friendly for indexing applications. We will illustrate the effectiveness and efficiency of the approach on a number of real data sets.

Keywords

data mining; graph theory; data reduction method; dimensionality reduction method; disk-resident data sets; graph indexing; graph mining; graph retrieval; Algorithm design and analysis; Bridges; Computational complexity; Data mining; Indexing; Partitioning algorithms; Social network services;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Engineering (ICDE), 2011 IEEE 27th International Conference on

Conference_Location

Hannover

ISSN

1063-6382

Print_ISBN

978-1-4244-8959-6

Electronic_ISBN

1063-6382

Type

conf

DOI

10.1109/ICDE.2011.5767834

Filename

5767834