مرکز منطقه ای اطلاع رساني علوم و فناوري - Some statistical issues related to multiple linear regression modeling of beach bacteria concentrations

Abstract :

As a fast and effective technique, the multiple linear regression (MLR) method has been widely used in modeling and prediction of beach bacteria concentrations. Among previous works on this subject, however, several issues were insufficiently or inconsistently addressed. Those issues include the value and use of interaction terms, the serial correlation, the criteria for model selection, and model assessment. The present work shows that serial correlations, as often present in sequentially observed data records, deserve full attention from the modeler. The testing and adjustment for the time-series effect should be implemented in a statistically rigorous framework. The R2 and Cp-statistic as joint criteria are recommended for the model selection process, while using the t-statistics associated with the full model is erroneous. During model selection, using interaction terms can often help to decrease the bias in reduced models, although the resulting improvement in the numerical performance may be limited. For the assessment of the model predictive capacity, which is different from testing the goodness of fit, a comprehensive set of statistics are advocated to allow for an objective evaluation of different models. Results obtained from the data at Huntington Beach, OH, show that erroneous conclusions could be drawn if only the model R2 and the count of type I and type II errors are considered. In this sense, several previous works deserve further investigation.