• DocumentCode
    3165663
  • Title

    Sample Selection for Maximal Diversity

  • Author

    Feng Pan ; Roberts, A. ; McMillan, L. ; de Villena, F.P.M. ; Threadgill, D. ; Wei Wang

  • Author_Institution
    Univ. of North Carolina, Chapel Hill
  • fYear
    2007
  • fDate
    28-31 Oct. 2007
  • Firstpage
    262
  • Lastpage
    271
  • Abstract
    The problem of selecting a sample subset sufficient to preserve diversity arises in many applications. One example is in the design of recombinant inbred lines (RIL) for genetic association studies. In this context, genetic diversity is measured by how many alleles are retained in the resulting inbred strains. RIL panels that are derived from more than two parental strains, such as the collaborative cross (Churchill et al., 2004), present a particular challenge with regard to which of the many existing lab mouse strains should be included in the initial breeding funnel in order to maximize allele retention. A similar problem occurs in the study of customer reviews when selecting a subset of products with a maximal diversity in reviews. Diversity in this case implies the presence of a set of products having both positive and negative ranks for each customer. In this paper, we demonstrate that selecting an optimal diversity subset is an NP-complete problem via reduction to set cover. This reduction is sufficiently tight that greedy approximations to the set cover problem directly apply to maximizing diversity. We then suggest a slightly modified subset selection problem in which an initial greedy diversity solution is used to effectively prune an exhaustive search for all diversity subsets bounded from below by a specified coverage threshold. Extensive experiments on real datasets are performed to demonstrate the effectiveness and efficiency of our approach.
  • Keywords
    biology computing; computational complexity; data mining; genetics; greedy algorithms; NP-complete problem; allele retention; breeding funnel; genetic association studies; genetic diversity; greedy approximations; greedy diversity; inbred strains; maximal diversity; recombinant inbred lines; sample selection; sample subset selection; Application software; Capacitive sensors; Collaboration; Computer science; Data mining; Discrete wavelet transforms; Frequency; Genetics; Mice; Strain measurement;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on
  • Conference_Location
    Omaha, NE
  • ISSN
    1550-4786
  • Print_ISBN
    978-0-7695-3018-5
  • Type

    conf

  • DOI
    10.1109/ICDM.2007.16
  • Filename
    4470250