DocumentCode :
1984546
Title :
A Distributed Methodology for Imbalanced Classification Problems
Author :
Lemnaru, Camelia ; Cuibus, Mihai ; Bona, Adrian ; Alic, Andrei ; Potolea, Rodica
Author_Institution :
Comput. Sci. Dept., Tech. Univ. of Cluj-Napoca, Cluj-Napoca, Romania
fYear :
2012
fDate :
25-29 June 2012
Firstpage :
164
Lastpage :
171
Abstract :
Current important challenges in data mining research are triggered by the need to address various particularities of real-world problems, such as imbalanced data and error cost distributions. This paper presents Distributed Evolutionary Cost-Sensitive Balancing, a distributed methodology for dealing with imbalanced data and -- if necessary -- cost distributions. The method employs a genetic algorithm to search for an optimal cost matrix and base classifier settings, which are then employed by a cost-sensitive classifier, wrapped around the base classifier. Individual fitness computation is the most intensive task in the algorithm, but it also presents a high parallelization potential. Two different parallelization alternatives have been explored: a computation-driven approach, and a data-driven approach. Both have been developed within the Apache Watchmaker framework and deployed on Hadoop-based infrastructures. Experimental evaluations performed up to this point have indicated that the computation-driven approach achieves a good classification performance, but does not reduce the running time significantly, the data-driven approach reduces the running time for slow algorithms, such as the kNN and the SVM, while still yielding important performance improvements.
Keywords :
data mining; genetic algorithms; matrix algebra; parallel processing; pattern classification; Apache Watchmaker framework; Hadoop-based infrastructures; SVM; base classifier settings; computation-driven approach; cost-sensitive classifier; data mining research; data-driven approach; distributed evolutionary cost-sensitive balancing; distributed methodology; error cost distributions; genetic algorithm; imbalanced classification problems; imbalanced data; individual fitness computation; kNN; optimal cost matrix; parallelization alternatives; Classification algorithms; Complexity theory; Context; Genetic algorithms; Sociology; Statistics; Training; Hadoop; distributed cost-sensitive balancing; imbalanced classification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Computing (ISPDC), 2012 11th International Symposium on
Conference_Location :
Munich/Garching, Bavaria
Print_ISBN :
978-1-4673-2599-8
Type :
conf
DOI :
10.1109/ISPDC.2012.30
Filename :
6341508
Link To Document :
بازگشت