The focus of this paper is on the generation of multimodal explanations for information fusion tasks performed on multimodal data. We propose that separating modal components in saliency map explanations provides users with a better understanding of how convolutional neural networks process multimodal data. We adapt established state-of-the-art explainability techniques to mid-level fusion networks in order to better understand (a) which modality of the input contributes most to a model's decision and (b) which parts of the input data are most relevant to that decision. Our method separates temporal from non-temporal information to allow a user to focus their attention on salient elements of the scene that are changing in multiple modalities. The work is experimentally tested on an activity recognition task using video and audio data. In view of the fact that explanations need to be tailored to the type of user in a User Fusion context, we focus on meeting explanation requirements for system creators and operators respectively.