Neural networks can leverage self-supervision to learn integrated representations across multiple data modalities. This makes them suitable to uncover complex relationships between vastly different data types, thus lowering the dependency on labor-intensive feature engineering methods. Leveraging deep representation learning, we propose a generic, robust and systematic model that is able to combine multiple data modalities in a permutation and modes number-invariant fashion. Both fundamental properties to properly face changes in data type content and availability. To this end, we treat each multi-modal data sample as a set and utilise autoencoders to learn a fixed size, permutation invariant representation that can be used in any decision making process. We build upon previous work that demonstrates the feasibility of presenting a set as an input to autoencoders through content-based attention mechanisms. However, since model inputs and outputs are permutation invariant, we develop an end-to-end architecture to approximate the solution of a linear sum assignment problem, i.e., a minimum-cost bijective mapping problem, to ensure a match between the elements of the input and the reconstructed set. For dimensions up to 128, the network demonstrates near-perfect accuracy in matching these two sets. Combining the content-based attention mechanism for set processing with our aforementioned matching network allows us to construct a Fully Differentiable Set Autoencoder. We demonstrate the model capability to learn a combined representation while preserving individual mode characteristics focusing on the task of reconstructing multi-omic cancer data.