Robust and scalable graph-based semisupervised learning
Abstract
Graph-based semisupervised learning (GSSL) provides a promising paradigm for modeling the manifold structures that may exist in massive data sources in high-dimensional spaces. It has been shown effective in propagating a limited amount of initial labels to a large amount of unlabeled data, matching the needs of many emerging applications such as image annotation and information retrieval. In this paper, we provide reviews of several classical GSSL methods and a few promising methods in handling challenging issues often encountered in web-scale applications. First, to successfully incorporate the contaminated noisy labels associated with web data, label diagnosis and tuning techniques applied to GSSL are surveyed. Second, to support scalability to the gigantic scale (millions or billions of samples), recent solutions based on anchor graphs are reviewed. To help researchers pursue new ideas in this area, we also summarize a few popular data sets and software tools publicly available. Important open issues are discussed at the end to stimulate future research. © 2012 IEEE.