• DocumentCode
    1559
  • Title

    Information-Theoretic Outlier Detection for Large-Scale Categorical Data

  • Author

    Shu Wu ; Shengrui Wang

  • Author_Institution
    Nat. Lab. of Pattern Recognition (NLPR), Inst. of Autom., Beijing, China
  • Volume
    25
  • Issue
    3
  • fYear
    2013
  • fDate
    Mar-13
  • Firstpage
    589
  • Lastpage
    602
  • Abstract
    Outlier detection can usually be considered as a pre-processing step for locating, in a data set, those objects that do not conform to well-defined notions of expected behavior. It is very important in data mining for discovering novel or rare events, anomalies, vicious actions, exceptional phenomena, etc. We are investigating outlier detection for categorical data sets. This problem is especially challenging because of the difficulty of defining a meaningful similarity measure for categorical data. In this paper, we propose a formal definition of outliers and an optimization model of outlier detection, via a new concept of holoentropy that takes both entropy and total correlation into consideration. Based on this model, we define a function for the outlier factor of an object which is solely determined by the object itself and can be updated efficiently. We propose two practical 1-parameter outlier detection methods, named ITB-SS and ITB-SP, which require no user-defined parameters for deciding whether an object is an outlier. Users need only provide the number of outliers they want to detect. Experimental results show that ITB-SS and ITB-SP are more effective and efficient than mainstream methods and can be used to deal with both large and high-dimensional data sets where existing algorithms fail.
  • Keywords
    data mining; entropy; optimisation; 1-parameter outlier detection method; ITB-SP; ITB-SS; categorical data sets; data mining; high-dimensional data sets; holoentropy; information-theoretic outlier detection; large-scale categorical data; optimization model; outlier factor; preprocessing step; similarity measure; total correlation; Complexity theory; Greedy algorithms; Holoentropy; Information retrieval; Mutual information; Search methods; Outlier detection; attribute weighting; greedy algorithms; holoentropy; outlier factor; total correlation;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2011.261
  • Filename
    6109256