Context-sensitive prediction of facial expressivity using multimodal hierarchical Bayesian neural networks
Objective automated affect analysis systems can be applied to quantify the progression of symptoms in neurodegenerative diseases such as Parkinson's Disease (PD). PD hampers the ability of patients to emote by decreasing the mobility of their facial musculature, a phenomenon known as ''facial masking.'' In this work, we focus on building a system that can predict an accurate score of active facial expressivity in people suffering from Parkinson's disease using features extracted from both video and audio. An ideal automated system should be able to mimic the ability of human experts to take into account contextual information while making these predictions. For example, patients exhibit different emotions with varying intensities when speaking about positive and negative experiences. We utilize a hierarchical Bayesian neural network framework to enable the learning of model parameters that subtly adapt to pre-defined notions of context, such as the gender of the patient or the valence of the expressed sentiment. We evaluate our formulation on a dataset of 772 20-second video clips of Parkinson's disease patients and demonstrate that training a context-specific hierarchical Bayesian framework yields an improvement in model performance in both multiclass classification and regression settings compared to baseline models trained on all data pooled together.