Abstract :
In this study amended parallel analysis (APA), a novel
method for model selection in unsupervised learning
problems such as information retrieval (IR), is described.
At issue is the selection of k, the number of dimensions
retained under latent semantic indexing (LSI).
Amended parallel analysis is an elaboration of Horn’s
parallel analysis, which advocates retaining eigenvalues
larger than those that we would expect under term independence.
Amended parallel analysis operates by deriving
confidence intervals on these “null” eigenvalues.
The technique amounts to a series of nonparametric hypothesis
tests on the correlation matrix eigenvalues. In
the study, APA is tested along with four established dimensionality
estimators on six standard IR test collections.
These estimates are evaluated with regard to two
IR performance metrics. Additionally, results from simulated
data are reported. In both rounds of experimentation
APA performs well, predicting the best values of k
on 3 of 12 observations, with good predictions on several
others, and never offering the worst estimate of optimal
dimensionality.