State space models, such as Kalman filters or Particle filters, have been applied to improve the accuracy of radio-wave-based localization. However, these models can drift radically when assumptions of the models are violated, and they do not have a mechanism to fix errors. Therefore, we propose an approach to apply supervised learning to pedestrian localization, which is based on the Inference Machines framework. During training, we collect localization ground truths using computer vision while also collecting Bluetooth signals to train a state space model for localization, which can recover from model drift. During testing, our proposed approach uses only Bluetooth signals. Our experimental results show that our approach can improve the accuracy of Bluetooth-based localization with a small number of training examples. Moreover, our multi-modal supervision can also be used to estimate additional parameters, such as device rotation, from Bluetooth signals that do not have such information.