Parallel pairwise clustering

Elad Yom-Tov; Noam Slonim

SDM 2009

Conference paper

31 Dec 2009

Parallel pairwise clustering

Abstract

Given the pairwise affinity relations associated with a set of data items, the goal of a clustering algorithm is to automatically partition the data into a small number of homogeneous clusters. However, since the input size is quadratic in the number of data points, existing algorithms are non feasible for many practical applications. Here, we propose a simple strategy to cluster massive data by randomly splitting the original affinity matrix into small manageable affinity matrices that are clustered independently. Our proposal is most appealing in a parallel computing environment where at each iteration, each worker node clusters a subset of the input data and the results from all workers are then integrated in a master node to create a new clustering partition over the entire data. We demonstrate that this approach yields high quality clustering partitions for various real world problems, even though at each iteration only small fractions of the original data matrix are examined and at no point is the entire affinity matrix stored in memory or even computed. Furthermore, we demonstrate that the proposed algorithm has intriguing stochastic convergence properties that provide further insight into the clustering problem.

Conference paper