DocumentCode :
3394424
Title :
Very large scale ReliefF for genome-wide association analysis
Author :
Eppstein, Margaret J. ; Haake, Paul
Author_Institution :
Comput. Sci. Dept., Univ. of Vermont, Burlington, VT
fYear :
2008
fDate :
15-17 Sept. 2008
Firstpage :
112
Lastpage :
119
Abstract :
Most common diseases are the result of complex nonlinear interactions between multiple genetic and environmental components. There is thus a pressing need for new computational methods capable of detecting nonlinearly interacting single nucleotide polymorphism (SNPs) that are associated with disease, from amidst up to hundreds of thousands of candidate SNPs. Recently, some progress has been made using feature selection algorithms based on weights from the ReliefF data mining algorithm on sets of up to 1500 SNPs. However, the accuracy of ReliefF does not scale up to the sizes needed for truly large genome-scale SNP association studies. We propose a population-based variant dubbed VLSReliefF, which mitigates this performance drop by stochastically applying ReliefF to SNP subsets, and then assigning each SNP the maximum ReliefF weight it achieved in any subset. A heuristic method is proposed for determining the optimal subset size as a function of heritability, sample size, and order of interactions. Aggressive iterative pruning of SNPs with low VLSReliefF weights can be used for nonlinear feature identification in genome scale SNP sets. The method is validated using a variety of computational experiments on synthetic datasets of up to 100,000 SNPs.
Keywords :
data mining; diseases; feature extraction; genomics; medical information systems; molecular biophysics; polymorphism; very large databases; aggressive iterative pruning; complex nonlinear interactions; data mining algorithm; diseases; environmental components; feature selection algorithms; genome scale SNP sets; genome-wide SNP association analysis; multiple genetic components; nonlinear feature identification; optimal subset size; population-based variant dubbed VLSReliefF; single nucleotide polymorphism; very-large scale Relieff; Bioinformatics; Cardiac disease; Cardiovascular diseases; Data mining; Genetics; Genomics; Humans; Large-scale systems; Machine learning; Pressing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Intelligence in Bioinformatics and Computational Biology, 2008. CIBCB '08. IEEE Symposium on
Conference_Location :
Sun Valley, ID
Print_ISBN :
978-1-4244-1778-0
Electronic_ISBN :
978-1-4244-1779-7
Type :
conf
DOI :
10.1109/CIBCB.2008.4675767
Filename :
4675767
Link To Document :
بازگشت