DocumentCode :
2111183
Title :
Distributed log information processing with Map-Reduce: A case study from raw data to final models
Author :
Luo, Mingyue ; Liu, Gang
Author_Institution :
Sch. of Electron. Eng., Beijing Univ. of Posts & Telecommun., Beijing, China
fYear :
2010
fDate :
17-19 Dec. 2010
Firstpage :
1143
Lastpage :
1146
Abstract :
With the high development of Internet, e-commerce websites now routinely have to work with log datasets which are up to a few terabytes in size. How to remove messy data timely with low cost and find out useful information is a problem we have to face. The mining process involves several steps from pre-processing the raw data to establishing the final models. In this paper we describe our method to solve the problem with Map-Reduce. Hadoop is a Map-Reduce implementation develops open-source software for reliable, scalable, distributed computing. Several applications which we have proposed: data extracting, sum operation, join operation and clustering algorithm are applied on hadoop. We can apply them on data pre-processing and detect users with the same interests. In particular, we focus on clustering algorithms. A clustering algorithms which integrate SOM (Self-Organized Map) and fuzzy logic is combined with Map-Reduce and we call it MRSF here. With the help of hadoop cluster, large calculation of jobs with MRSF can be accommodated easily by just adding more nodes or computers to the cluster. From the experiment, we show that MRSF can scale well and efficiently process and analyze extremely large datasets.
Keywords :
Internet; Web sites; data analysis; data mining; distributed processing; electronic commerce; fuzzy logic; pattern clustering; public domain software; self-organising feature maps; Hadoop; Internet; clustering algorithm; data extraction; data mining; distributed computing; distributed log information processing; e-commerce Website; fuzzy logic; join operation; log dataset; map-reduce; open-source software; raw data processing; self-organized map; sum operation; user interest detection; Clustering algorithms; Computational modeling; Computers; Data mining; Data models; Distributed databases; Training; Distributed Data Mining; Map-Reduce; data pre-processing; join operation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Theory and Information Security (ICITIS), 2010 IEEE International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4244-6942-0
Type :
conf
DOI :
10.1109/ICITIS.2010.5689760
Filename :
5689760
Link To Document :
بازگشت