Pooling acoustic and lexical features for the prediction of valence

Zakaria Aldeneh; Soheil Khorram; Dimitrios Dimitriadis; Emily Mower Provost

doi:10.1145/3136755.3136760

ICMI 2017

Conference paper

03 Nov 2017

Pooling acoustic and lexical features for the prediction of valence

View publication

Abstract

In this paper, we present an analysis of different multimodal fusion approaches in the context of deep learning, focusing on pooling intermediate representations learned for the acoustic and lexical modalities. Traditional approaches to multimodal feature pooling include: concatenation, element-wise addition, and element-wise multiplication. We compare these traditional methods to outerproduct and compact bilinear pooling approaches, which consider more comprehensive interactions between features from the two modalities. We also study the influence of each modality on the overall performance of a multimodal system. Our experiments on the IEMOCAP dataset suggest that: (1) multimodal methods that combine acoustic and lexical features outperform their unimodal counterparts; (2) the lexical modality is better for predicting valence than the acoustic modality; (3) outer-product-based pooling strategies outperform other pooling strategies.

Conference paper