مرکز منطقه ای اطلاع رساني علوم و فناوري - Very large scale ReliefF for genome-wide association analysis

DocumentCode :

3394424

Title :

Very large scale ReliefF for genome-wide association analysis

Author :

Eppstein, Margaret J. ; Haake, Paul

Author_Institution :

Comput. Sci. Dept., Univ. of Vermont, Burlington, VT

fYear :

2008

fDate :

15-17 Sept. 2008

Firstpage :

112

Lastpage :

119

Abstract :

Most common diseases are the result of complex nonlinear interactions between multiple genetic and environmental components. There is thus a pressing need for new computational methods capable of detecting nonlinearly interacting single nucleotide polymorphism (SNPs) that are associated with disease, from amidst up to hundreds of thousands of candidate SNPs. Recently, some progress has been made using feature selection algorithms based on weights from the ReliefF data mining algorithm on sets of up to 1500 SNPs. However, the accuracy of ReliefF does not scale up to the sizes needed for truly large genome-scale SNP association studies. We propose a population-based variant dubbed VLSReliefF, which mitigates this performance drop by stochastically applying ReliefF to SNP subsets, and then assigning each SNP the maximum ReliefF weight it achieved in any subset. A heuristic method is proposed for determining the optimal subset size as a function of heritability, sample size, and order of interactions. Aggressive iterative pruning of SNPs with low VLSReliefF weights can be used for nonlinear feature identification in genome scale SNP sets. The method is validated using a variety of computational experiments on synthetic datasets of up to 100,000 SNPs.

Keywords :

data mining; diseases; feature extraction; genomics; medical information systems; molecular biophysics; polymorphism; very large databases; aggressive iterative pruning; complex nonlinear interactions; data mining algorithm; diseases; environmental components; feature selection algorithms; genome scale SNP sets; genome-wide SNP association analysis; multiple genetic components; nonlinear feature identification; optimal subset size; population-based variant dubbed VLSReliefF; single nucleotide polymorphism; very-large scale Relieff; Bioinformatics; Cardiac disease; Cardiovascular diseases; Data mining; Genetics; Genomics; Humans; Large-scale systems; Machine learning; Pressing;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computational Intelligence in Bioinformatics and Computational Biology, 2008. CIBCB '08. IEEE Symposium on

Conference_Location :

Sun Valley, ID

Print_ISBN :

978-1-4244-1778-0

Electronic_ISBN :

978-1-4244-1779-7

Type :

conf

DOI :

10.1109/CIBCB.2008.4675767

Filename :

4675767

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3394424