مرکز منطقه ای اطلاع رساني علوم و فناوري - Separation of data on the training and test set for modelling: a case study for modelling of five colour properties of a white pigment

Abstract :

In order to evaluate the influence of the choice of the data for the training set on the prediction ability of linear and nonlinear models, various methods for sample selection were tested. The study is carried on for modelling of five colour properties: whiteness (W10), lightness (L* and Lp*), and hue (b* and bp*) of a titanium dioxide white pigment. In all variations of data selections and modelling, the same set of 132 samples of white pigment produced in a 6-month period was employed. As the modelling techniques standard multiple linear regression (MLR), radial basis functions (RBF) model and two artificial neural networks (ANNs) learning strategies, error-backpropagation (EBP ANN) and counterpropagation (CP ANN), were used. For each of the four modelling techniques, four different sample selections were picked out using the following methods: time–equidistant sampling of pigments produced during 6-month period (time dependent for short), random selection (RS), sampling from Kohonen self-organised top-maps (KOH), and Kennard-Stone maximal distance approach. Each time, exactly 66 samples for the training and 66 samples for the testing were chosen. The 66 testing objects were further divided into the test and control set. Only 13 objects were present in all four testing sets. These 13 objects were kept aside for the final control set, while the remaining 53 obtained from each division were used for testing the generated models at the very end of the entire modelling generation part of the work. ample (white pigment) in the study is characterised by 17 independent and five dependent variables. The best 80 models (for five pigment properties, each modelled by four different modelling methods, each of which generated by the training set of objects obtained by four different division methods) were tested and results were reported. It was found out that the differences in the quality of prediction abilities of models obtained by different modelling techniques are statistically significant (within α=0.05), while the division method is not. As the best modelling method, the error backpropagation was established. However, several exceptions from the general observations are present and discussed.