• DocumentCode
    3013143
  • Title

    An efficient approximation scheme for data mining tasks

  • Author

    Kollios, George ; Gunupulos, D. ; Koudas, Nick ; Berchtold, Stefan

  • Author_Institution
    Boston Univ., MA, USA
  • fYear
    2001
  • fDate
    2001
  • Firstpage
    453
  • Lastpage
    462
  • Abstract
    We investigate the use of biased sampling according to the density of the dataset, to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional datasets. In density biased sampling, the probability that a given point will be included in the sample depends on the local density of the dataset. We propose a general technique for density-biased sampling that can factor in user requirements to sample for properties of interest, and can be tuned for specific data mining tasks. This allows great flexibility and improved accuracy of the results over simple random sampling. We describe our approach in detail, we analytically evaluate it, and show how it can be optimized for approximate clustering and outlier detection. Finally we present a thorough experimental evaluation of the proposed method, applying density-biased sampling on real and synthetic data sets, and employing clustering and outlier detection algorithms, thus highlighting the utility of our approach
  • Keywords
    data mining; distributed databases; pattern clustering; sampling methods; approximate clustering; approximation scheme; biased sampling; data mining tasks; dataset density; density biased sampling; density-biased sampling; general data mining tasks; large multidimensional datasets; local density; outlier detection; outlier detection algorithms; simple random sampling; synthetic data sets; user requirements; Clustering algorithms; Data analysis; Data mining; Detection algorithms; Engineering profession; Laboratories; Multidimensional systems; Probability distribution; Sampling methods; US Department of Transportation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2001. Proceedings. 17th International Conference on
  • Conference_Location
    Heidelberg
  • ISSN
    1063-6382
  • Print_ISBN
    0-7695-1001-9
  • Type

    conf

  • DOI
    10.1109/ICDE.2001.914858
  • Filename
    914858