Abstract :
Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting real-world applications produce huge volumes of messy data. The mining process involves several steps, starting from pre-processing the raw data to estimating the final models. As data become more abundant, scalable and easy-to-use tools for distributed processing are also emerging. Among those, Map-Reduce has been widely embraced by both academia and industry. In database terms, Map-Reduce is a simple yet powerful execution engine, which can be complemented with other data storage and management components, as necessary. In this paper we describe our experiences and findings in applying Map-Reduce, from raw data to final models, on an important mining task. In particular, we focus on co-clustering, which has been studied in many applications such as text mining, collaborative filtering, bio-informatics, graph mining. We propose the distributed co-clustering (DisCo) framework, which introduces practical approaches for distributed data pre-processing, and co-clustering. We develop DisCo using Hadoop, an open source Map-Reduce implementation. We show that DisCo can scale well and efficiently process and analyze extremely large datasets (up to several hundreds of gigabytes) on commodity hardware.
Keywords :
data mining; distributed processing; pattern clustering; storage management; DisCo; Hadoop; bioinformatics; collaborative filtering; data storage; distributed co-clustering; distributed data pre-processing; distributed processing; execution engine; graph mining; messy data; mining process; open source Map-Reduce implementation; petabyte-scale end-to-end mining; real-world applications; text mining; Collaboration; Data analysis; Data mining; Databases; Distributed processing; Energy management; Engines; Filtering; Memory; Text mining; co-clustering; data mining; graphs; mapreduce;