This paper presents our methods to the Audio-Video Based Emotion Recognition subtask in the 2017 Emotion Recognition in the Wild (EmotiW) Challenge. The task aims to predict one of the seven basic emotions for short video segments. We extract different features from audio and facial expression modalities. We also explore the temporal LSTM model with the input of frame facial features, which improves the performance of the non-Temporal model. The fusion of different modality features and the temporal model lead us to achieve a 58.5% accuracy on the testing set, which shows the effectiveness of our methods.