Recognition of natural emotions from human faces is an interesting topic with a wide range of potential applications, such as human-computer interaction, automated tutoring systems, image and video retrieval, smart environments, and driver warning systems. Traditionally, facial emotion recognition systems have been evaluated on laboratory controlled data, which is not representative of the environment faced in real-world applications. To robustly recognize the facial emotions in real-world natural situations, this paper proposes an approach called extreme sparse learning, which has the ability to jointly learn a dictionary (set of basis) and a nonlinear classification model. The proposed approach combines the discriminative power of extreme learning machine with the reconstruction property of sparse representation to enable accurate classification when presented with noisy signals and imperfect data recorded in natural settings. In addition, this paper presents a new local spatio-temporal descriptor that is distinctive and pose-invariant. The proposed framework is able to achieve the state-of-the-art recognition accuracy on both acted and spontaneous facial emotion databases.