Pooling acoustic and lexical features for the prediction of valence
In this paper, we present an analysis of different multimodal fusion approaches in the context of deep learning, focusing on pooling intermediate representations learned for the acoustic and lexical modalities. Traditional approaches to multimodal feature pooling include: concatenation, element-wise addition, and element-wise multiplication. We compare these traditional methods to outerproduct and compact bilinear pooling approaches, which consider more comprehensive interactions between features from the two modalities. We also study the influence of each modality on the overall performance of a multimodal system. Our experiments on the IEMOCAP dataset suggest that: (1) multimodal methods that combine acoustic and lexical features outperform their unimodal counterparts; (2) the lexical modality is better for predicting valence than the acoustic modality; (3) outer-product-based pooling strategies outperform other pooling strategies.