مرکز منطقه ای اطلاع رساني علوم و فناوري - DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining

DocumentCode :

2709762

Title :

DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining

Author :

Papadimitriou, Spiros ; Sun, Jimeng

Author_Institution :

IBM T.J. Watson Res. Center, Hawthorne, NY

fYear :

2008

fDate :

15-19 Dec. 2008

Firstpage :

512

Lastpage :

521

Abstract :

Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting real-world applications produce huge volumes of messy data. The mining process involves several steps, starting from pre-processing the raw data to estimating the final models. As data become more abundant, scalable and easy-to-use tools for distributed processing are also emerging. Among those, Map-Reduce has been widely embraced by both academia and industry. In database terms, Map-Reduce is a simple yet powerful execution engine, which can be complemented with other data storage and management components, as necessary. In this paper we describe our experiences and findings in applying Map-Reduce, from raw data to final models, on an important mining task. In particular, we focus on co-clustering, which has been studied in many applications such as text mining, collaborative filtering, bio-informatics, graph mining. We propose the distributed co-clustering (DisCo) framework, which introduces practical approaches for distributed data pre-processing, and co-clustering. We develop DisCo using Hadoop, an open source Map-Reduce implementation. We show that DisCo can scale well and efficiently process and analyze extremely large datasets (up to several hundreds of gigabytes) on commodity hardware.

Keywords :

data mining; distributed processing; pattern clustering; storage management; DisCo; Hadoop; bioinformatics; collaborative filtering; data storage; distributed co-clustering; distributed data pre-processing; distributed processing; execution engine; graph mining; messy data; mining process; open source Map-Reduce implementation; petabyte-scale end-to-end mining; real-world applications; text mining; Collaboration; Data analysis; Data mining; Databases; Distributed processing; Energy management; Engines; Filtering; Memory; Text mining; co-clustering; data mining; graphs; mapreduce;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Data Mining, 2008. ICDM '08. Eighth IEEE International Conference on

Conference_Location :

Pisa

ISSN :

1550-4786

Print_ISBN :

978-0-7695-3502-9

Type :

conf

DOI :

10.1109/ICDM.2008.142

Filename :

4781146

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2709762