• DocumentCode
    1125998
  • Title

    A discretization algorithm based on a heterogeneity criterion

  • Author

    Liu, Xiaoyan ; Wang, Huaiqing

  • Author_Institution
    Dept. of Inf. Syst., City Univ. of Hong Kong, Kowloon, China
  • Volume
    17
  • Issue
    9
  • fYear
    2005
  • Firstpage
    1166
  • Lastpage
    1173
  • Abstract
    Discretization, as a preprocessing step for data mining, is a process of converting the continuous attributes of a data set into discrete ones so that they can be treated as the nominal features by machine learning algorithms. Those various discretization methods, that use entropy-based criteria, form a large class of algorithm. However, as a measure of class homogeneity, entropy cannot always accurately reflect the degree of class homogeneity of an interval. Therefore, in this paper, we propose a new measure of class heterogeneity of intervals from the viewpoint of class probability itself. Based on the definition of heterogeneity, we present a new criterion to evaluate a discretization scheme and analyze its property theoretically. Also, a heuristic method is proposed to find the approximate optimal discretization scheme. Finally, our method is compared, in terms of predictive error rate and tree size, with Ent-MDLC, a representative entropy-based discretization method well-known for its good performance. Our method is shown to produce better results than those of Ent-MDLC, although the improvement is not significant. It can be a good alternative to entropy-based discretization methods.
  • Keywords
    data analysis; data mining; heuristic programming; learning (artificial intelligence); probability; very large databases; Ent-MDLC; class homogeneity; class probability; data mining; data preparation; discretization algorithm; entropy-based discretization method; heterogeneity criterion; heuristic method; machine learning algorithm; predictive error rate; Computer Society; Data mining; Decision trees; Discrete transforms; Entropy; Error analysis; Frequency conversion; Machine learning; Machine learning algorithms; Spatial databases; Index Terms- Data mining; data preparation; discretization; entropy; heterogeneity.;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2005.135
  • Filename
    1490524