Title :
On the Computation of Stochastic Search Variable Selection in Linear Regression with UDFs
Author :
Navas, Mario ; Ordonez, Carlos ; Baladandayuthapani, Veerabhadran
Author_Institution :
Dept. of Comput. Sci., Univ. of Houston, Houston, TX, USA
Abstract :
Computing Bayesian statistics with traditional techniques is extremely slow, specially when large data has to be exported from a relational DBMS. We propose algorithms for large scale processing of stochastic search variable selection (SSVS) for linear regression that can work entirely inside a DBMS. The traditional SSVS algorithm requires multiple scans of the input data in order to compute a regression model. Due to our optimizations, SSVS can be done in either one scan over the input table for large number of records with sufficient statistics, or one scan per iteration for high-dimensional data. We consider storage layouts which efficiently exploit DBMS parallel processing of aggregate functions. Experimental results demonstrate correctness, convergence and performance of our algorithms. Finally, the algorithms show good scalability for data with a very large number of records, or a very high number of dimensions.
Keywords :
Bayes methods; data mining; iterative methods; optimisation; regression analysis; relational databases; stochastic processes; Bayesian statistics; SSVS algorithm; UDF; data mining; data scanning; linear regression; parallel processing; relational DBMS; stochastic search variable selection; user defined function; Bayesian statistics; UDF; variable selection;
Conference_Titel :
Data Mining (ICDM), 2010 IEEE 10th International Conference on
Conference_Location :
Sydney, NSW
Print_ISBN :
978-1-4244-9131-5
Electronic_ISBN :
1550-4786
DOI :
10.1109/ICDM.2010.79