Title :
Correlation clustering
Author :
Bansal, Nikhil ; Blum, Avrim ; Chawla, Shuchi
Author_Institution :
Dept. of Comput. Sci., Carnegie Mellon Univ., Pittsburgh, PA, USA
Abstract :
We consider the following clustering problem: we have a complete graph on n vertices (items), where each edge (u, υ) is labeled either + or - depending on whether a and υ have been deemed to be similar or different. The goal is to produce a partition of the vertices (a clustering) that agrees as much as possible with the edge labels. That is, we want a clustering that maximizes the number of + edges within clusters, plus the number of - edges between clusters (equivalently, minimizes the number of disagreements: the number of - edges inside clusters plus the number of + edges between clusters). This formulation is motivated from a document clustering problem in which one has a pairwise similarity function f learned from past data, and the goal is to partition the current set of documents in a way that correlates with f as much as possible; it can also be viewed as a kind of "agnostic learning" problem. An interesting feature of this clustering formulation is that one does not need to specify the number of clusters k as a separate parameter, as in measures such as k-median or min-sum or min-max clustering. Instead, in our formulation, the optimal number of clusters could be any value between 1 and n, depending on the edge labels. We look at approximation algorithms for both minimizing disagreements and for maximizing agreements. For minimizing disagreements, we give a constant factor approximation. For maximizing agreements we give a PTAS. We also show how to extend some of these results to graphs with edge labels in [-1, +1], and give some results for the case of random noise.
Keywords :
approximation theory; probability; randomised algorithms; approximation algorithms; complete graph; correlation clustering; edge labels; min-max clustering; min-sum clustering; pairwise similarity function; Approximation algorithms; Clustering algorithms; Computer science; Databases; Training data;
Conference_Titel :
Foundations of Computer Science, 2002. Proceedings. The 43rd Annual IEEE Symposium on
Print_ISBN :
0-7695-1822-2
DOI :
10.1109/SFCS.2002.1181947