مرکز منطقه ای اطلاع رساني علوم و فناوري - A linear algebra measure of cluster quality †

Abstract :

One of the most common models in information retrieval (IR), the vector space model, represents a document set as a term-document matrix where each row corresponds to a term and each column corresponds to a document. Because of the use of matrices in IR, it is possible to apply linear algebra to this IR model. This paper describes an application of linear algebra to text clustering, namely, a metric for measuring cluster quality. The metric is based on the theory that cluster quality is proportional to the number of terms that are disjoint across the clusters. The metric compares the singular values of the term-document matrix to the singular values of the matrices for each of the clusters to determine the amount of overlap of the terms across clusters. Because the metric can be difficult to interpret, a standardization of the metric is defined, which specifies the number of standard deviations a clustering of a document set is from an average, random clustering of that document set. Empirical evidence shows that the standardized cluster metric correlates with clustered retrieval performance when comparing clustering algorithms or multiple parameters for the same clustering algorithm.