In complex visual recognition systems, feature fusion has become crucial to discriminate between a large number of classes. In particular, fusing high-level context information with image appearance models can be effective in object/scene recognition. To this end, we develop an auto-context modeling approach under the RKHS (Reproducing Kernel Hilbert Space) setting, wherein a series of supervised learners are used to approximate the context model. By posing the problem of fusing the context and appearance models using multiple kernel learning, we develop a computationally tractable solution to this challenging problem. Furthermore, we propose to use the marginal probabilities from a kernel SVM classifier to construct the auto-context kernel. In addition to providing better regularization to the learning problem, our approach leads to improved recognition performance in comparison to using only the image features.