• DocumentCode
    3394424
  • Title

    Very large scale ReliefF for genome-wide association analysis

  • Author

    Eppstein, Margaret J. ; Haake, Paul

  • Author_Institution
    Comput. Sci. Dept., Univ. of Vermont, Burlington, VT
  • fYear
    2008
  • fDate
    15-17 Sept. 2008
  • Firstpage
    112
  • Lastpage
    119
  • Abstract
    Most common diseases are the result of complex nonlinear interactions between multiple genetic and environmental components. There is thus a pressing need for new computational methods capable of detecting nonlinearly interacting single nucleotide polymorphism (SNPs) that are associated with disease, from amidst up to hundreds of thousands of candidate SNPs. Recently, some progress has been made using feature selection algorithms based on weights from the ReliefF data mining algorithm on sets of up to 1500 SNPs. However, the accuracy of ReliefF does not scale up to the sizes needed for truly large genome-scale SNP association studies. We propose a population-based variant dubbed VLSReliefF, which mitigates this performance drop by stochastically applying ReliefF to SNP subsets, and then assigning each SNP the maximum ReliefF weight it achieved in any subset. A heuristic method is proposed for determining the optimal subset size as a function of heritability, sample size, and order of interactions. Aggressive iterative pruning of SNPs with low VLSReliefF weights can be used for nonlinear feature identification in genome scale SNP sets. The method is validated using a variety of computational experiments on synthetic datasets of up to 100,000 SNPs.
  • Keywords
    data mining; diseases; feature extraction; genomics; medical information systems; molecular biophysics; polymorphism; very large databases; aggressive iterative pruning; complex nonlinear interactions; data mining algorithm; diseases; environmental components; feature selection algorithms; genome scale SNP sets; genome-wide SNP association analysis; multiple genetic components; nonlinear feature identification; optimal subset size; population-based variant dubbed VLSReliefF; single nucleotide polymorphism; very-large scale Relieff; Bioinformatics; Cardiac disease; Cardiovascular diseases; Data mining; Genetics; Genomics; Humans; Large-scale systems; Machine learning; Pressing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Intelligence in Bioinformatics and Computational Biology, 2008. CIBCB '08. IEEE Symposium on
  • Conference_Location
    Sun Valley, ID
  • Print_ISBN
    978-1-4244-1778-0
  • Electronic_ISBN
    978-1-4244-1779-7
  • Type

    conf

  • DOI
    10.1109/CIBCB.2008.4675767
  • Filename
    4675767