DocumentCode
2843833
Title
Addressing Data-Complexity for Imbalanced Data-Sets: A Preliminary Study on the Use of Preprocessing for C4.5
Author
Luengo, Julián ; Fernandez, Alicia ; Herrera, Francisco ; Herrera, Francisco
Author_Institution
Dept. of Comput. Sci. & A.I., Univ. of Granada, Granada, Spain
fYear
2009
fDate
Nov. 30 2009-Dec. 2 2009
Firstpage
523
Lastpage
528
Abstract
In this work we analyse the behaviour of the C4.5 classification method with respect to a bunch of imbalanced data-sets. We consider the use of two metrics of data complexity known as ¿maximum Fishers discriminant ratio¿ and ¿nonlinearity of 1NN classifier¿, to analyse the effect of preprocessing (oversampling in this case) in order to deal with the imbalance problem. In order to do that, we analyse C4.5 over a wide range of imbalanced data-sets built from real data, and try to extract behaviour patterns from the results. We obtain rules that describe both good or bad behaviours of C4.5 in the case of using the original data-sets (absence of preprocessing) and when applying preprocessing. These rules allow us to determine the effect of the use of preprocessing and to predict the response of C4.5 to preprocessing from the data-set´s complexity metrics prior to its application, and then establish when the preprocessing would be useful to.
Keywords
pattern classification; 1NN classifier; C4.5 classification method; data complexity metrics; imbalanced data sets; maximum Fishers discriminant ratio; Application software; Classification tree analysis; Computer science; Data mining; Decision trees; Density measurement; Geometry; Intelligent systems; Pattern analysis; Topology; C4.5; Classification; Data complexity; Imbalanced Data-sets; Oversampling;
fLanguage
English
Publisher
ieee
Conference_Titel
Intelligent Systems Design and Applications, 2009. ISDA '09. Ninth International Conference on
Conference_Location
Pisa
Print_ISBN
978-1-4244-4735-0
Electronic_ISBN
978-0-7695-3872-3
Type
conf
DOI
10.1109/ISDA.2009.233
Filename
5364953
Link To Document