Predictive modeling with heterogeneous sources

Xiaoxiao Shi; Qi Liu; Wei Fan; Qiang Yang; Philip S. Yu

doi:10.1137/1.9781611972801.71

SDM 2010

Conference paper

29 Apr 2010

Predictive modeling with heterogeneous sources

View publication

Abstract

Lack of labeled training examples is a common problem for many applications. At the same time, there is often an abundance of labeled data from related tasks, although they have different distributions and outputs (e.g., different class labels, and different scales of regression values). In the medical domain, for example, we may have a limited number of vaccine efficacy examples against a new swine flu H1N1 epidemic, whereas there exists a large amount of labeled vaccine data from previous years' flu. However, it is difficult to directly apply the older flu vaccine data as training examples because of the difference in data distribution and efficacy output criteria between different viruses. To increase the sources of labeled data, we propose a method to utilize these examples whose marginal distribution and output criteria can be different. The idea is to first select a subset of source examples similar in distribution to the target data; all the selected instances are then "re-scaled" and assigned new output values from the labeled space of the target task. A new predictive model is built on the enlarged training set. We derive a generalization bound that specifically considers distribution difference and further evaluate the model on a number of applications. For an siRNA efficacy prediction problem, we extract examples from 4 heterogeneous regression tasks and 2 classification tasks to learn the target model, and achieve an average improvement of 30% in accuracy. Copyright © by SIAM.

Conference paper