Abstract
The problem of clustering has been widely studied by the data mining community because of its applications to a wide variety of problems in the context of customer segmentation, electronic commerce and learning. In general, the problem of clustering is generally presented as one of clustering individual instances of data records. In many applications, we have a collection of multiple sets of records. Each such set is essentially a database of records, and each database may possibly contain a different number of records. It is desirable to cluster these sets on the basis of the similarity of underlying data distribution. Thus, this problem may also be understood as that of clustering sets of data sets, as opposed to clustering sets of instances. The problem is especially challenging when the data sets are not available at one time, but are presented in the form of out-of-order and mixed streams, in which the records from different data sets do not arrive in any particular order, but are mixed with one another. In this paper, we present a first approach to the problem with the use of anchor-based summarization. We present experimental results for the effectiveness and efficiency of the approach on a number of real data sets. Copyright © 2012 by the Society for Industrial and Applied Mathematics.