In this paper, we study the problem of understanding human sentiments from large scale collection of Internet images based on both image features and contextual social network information (such as friend comments and user description). Despite the great strides in analyzing user sentiment based on text information, the analysis of sentiment behind the image content has largely been ignored. Thus, we extend the significant advances in text-based sentiment prediction tasks to the higherlevel challenge of predicting the underlying sentiments behind the images. We show that neither visual features nor the textual features are by themselves sufficient for accurate sentiment labeling. Thus, we provide a way of using both of them. We leverage the low-level visual features and mid-level attributes of an image, and formulate sentiment prediction problem as a non-negative matrix tri-factorization framework, which has the flexibility to incorporate multiple modalities of information and the capability to learn from heterogeneous features jointly. We develop an optimization algorithm for finding a local-optima solution under the proposed framework. With experiments on two large-scale datasets, we show that the proposed method improves significantly over existing state-of-the-art methods.