About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
Big Data 2016
Conference paper
Sampling-based distributed Kernel mean matching using spark
Abstract
Limited access to supervised information may forge scenarios in real-world data mining applications, where training and test data are interconnected by a covariate shift, i.e., having equal class conditional distribution with unequal covariate distribution. Traditional data mining techniques assume that both training and test data represent an identical distribution, therefore suffer in presence of a covariate shift. Kernel Mean Matching (KMM) is a well known approach that addresses covariate shift by weighing training instances appropriately. However, it has time complexity cubic in the size of training data, which is computationally impractical for large or streaming datasets due to limited scalability. In this paper, we present a sampling-based algorithm to address the limited scalability problem of KMM. Moreover, we show that the approach is highly parallelizable, and therefore propose a distributed algorithm for estimating training instance weights efficiently using Spark. Experiment results on benchmark datasets show that the proposed approach achieves competitive estimation accuracy within much lower execution time compared to the KMM algorithm. Moreover, it indicates that larger size of training data results into a higher accuracy with minimal effect on execution time of the proposed approach.