• DocumentCode
    3165491
  • Title

    Data Discretization Unification

  • Author

    Jin, Ruoming ; Breitbart, Yuri ; Muoh, Chibuike

  • Author_Institution
    Kent State Univ., Kent
  • fYear
    2007
  • fDate
    28-31 Oct. 2007
  • Firstpage
    183
  • Lastpage
    192
  • Abstract
    Data discretization is defined as a process of converting continuous data attribute values into a finite set of intervals with minimal loss of information. In this paper, we prove that discretization methods based on informational theoretical complexity and the methods based on statistical measures of data dependency are asymptotically equivalent. Furthermore, we define a notion of generalized entropy and prove that discretization methods based on MDLP, Gini Index, AIC, BIC, and Pearson´s X2 and G2 statistics are all derivable from the generalized entropy function. We design a dynamic programming algorithm that guarantees the best discretization based on the generalized entropy notion. Furthermore, we conducted an extensive performance evaluation of our method for several publicly available data sets. Our results show that our method delivers on the average 31% less classification errors than many previously known discretization methods.
  • Keywords
    data mining; dynamic programming; continuous data attribute values; data dependency; data discretization unification; dynamic programming algorithm; generalized entropy function; information minimal loss; informational theoretical complexity; Algorithm design and analysis; Association rules; Bayesian methods; Computer science; Data mining; Dynamic programming; Entropy; Error analysis; Heuristic algorithms; Statistics;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on
  • Conference_Location
    Omaha, NE
  • ISSN
    1550-4786
  • Print_ISBN
    978-0-7695-3018-5
  • Type

    conf

  • DOI
    10.1109/ICDM.2007.35
  • Filename
    4470242