In many real-world big data applications, the data distribution is not homogeneous over entire data, but instead varies across groups/clusters of data samples. Although a model's predictive performance remains vital, there is also a need to learn succinct sets of features that evolve and capture smooth variations in data distribution. These small sets of features not only lead to high prediction accuracy, but also discover the important underlying processes. We investigate this challenging problem by developing a novel multi-task learning paradigm that trains multiple support vector machine (SVM) classifiers over a set of related data clusters, and directly imposes smoothness constraints on adjacent classifiers. We show that such patterns can be effectively learned in the dual form of the classical SVM, and further show that a parsimonious solution can be achieved in the primal form. Although a solution can be effectively optimized via gradient descent, the technical development is not straightforward, requiring a relaxation over the loss function of SVMs. We demonstrate the performance of our algorithm in two practical application domains: team performance and road traffic prediction. Empirical results show our model not only achieves competitive prediction accuracy, but its discovered patterns truly capture and give intuition about the variation in the data distribution across multiple data clusters.