• DocumentCode
    2709583
  • Title

    Border Sampling through Coupling Markov Chain Monte Carlo

  • Author

    Li, Guichong ; Japkowicz, Nathalie ; Stocki, Trevor J. ; Ungar, R. Kurt

  • Author_Institution
    Comput. Sci. of Univ. of Ottawa, Ottawa, ON
  • fYear
    2008
  • fDate
    15-19 Dec. 2008
  • Firstpage
    393
  • Lastpage
    402
  • Abstract
    Recently, progressive border sampling (PBS) was proposed for sample selection in supervised learning by progressively learning an augmented full border from small labeled datasets. However, this quadratic learning algorithm is inapplicable to large datasets. In this paper, we incorporate the PBS to a state of the art technique called coupling Markov chain Monte Carlo (CMCMC) in an attempt to scale the original algorithm up on large labeled datasets. The CMCMC can produce an exact sample while a naive strategy for Markov chain Monte Carlo cannot guarantee the convergence to a stationary distribution. The resulting CMCMC-PBS algorithm is thus proposed for border sampling on large datasets. CMCMC-PBS exhibits several remarkable characteristics: linear time complexity, learner-independence, and a consistent convergence to an optimal sample from the original training sets by learning from their subsamples. Our experimental results on the 33 either small or large labeled datasets from the UCIKDD repository and a nuclear security application show that our new approach outperforms many previous sampling techniques for sample selection.
  • Keywords
    Markov processes; Monte Carlo methods; computational complexity; convergence; learning (artificial intelligence); sampling methods; convergence; coupling Markov chain Monte Carlo; labeled dataset; learner-independence; linear time complexity; original training set; progressive border sampling; quadratic learning algorithm; sample selection; sampling technique; stationary distribution; supervised learning; Computer science; Convergence; Costs; Data mining; Data security; Machine learning; Monte Carlo methods; Protection; Sampling methods; Supervised learning; border identification; markov chain monte carlo; sample selection;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2008. ICDM '08. Eighth IEEE International Conference on
  • Conference_Location
    Pisa
  • ISSN
    1550-4786
  • Print_ISBN
    978-0-7695-3502-9
  • Type

    conf

  • DOI
    10.1109/ICDM.2008.52
  • Filename
    4781134