Due to the large diffusion of wearable devices, the task of detecting motor activities from sensor data is becoming increasingly common in a wide range of applications. During the development of predictive models for activity recognition, data scientists generally rely on performance metrics (such as accuracy score) for evaluating and comparing the performance of classification algorithms. While these numerical estimates represent a straightforward way to summarize the effectiveness of a model, they convey little insights on the causes of misclassified events, not offering enough clues for data scientists to improve their algorithms. In this paper we present BlueSky Xplorer, an interactive visualization system to analyze, debug and compare the output of multiple predictive models at different levels of granularity. We combine classification results on multi-sensor data with the context of usage of each sensor and with ground truth information (such as textual labels and videos), representing them as temporally-aligned linear tracks. We then define an algebraic language over these tracks that enables users to quickly identify classification errors and to visually reason on the performance of classifiers. We demonstrate the usefulness of our tool by applying it to a real-world example, involving the development of models for assessing the symptoms of Parkinsons disease. In particular, we show how Xplorer was used to improve the performance of classification models and to discover problems in data temporal alignment.