Neural networks can leverage self-supervision to learn integrated representations across multiple data modalities. This makes them suitable to uncover complex relationships between vastly different data types, thus lowering the dependency on labor-intensive feature engineering methods. Leveraging deep representation learning, we propose a generic, robust and systematic model that is able to combine multiple data modalities in a permutation and modes-number-invariant fashion, both fundamental properties to properly face changes in data type content and availability. To this end, we treat each multi-modal data sample as a set and utilise autoencoders to learn a fixed size, permutation invariant representation that can be used in any decision making process. We build upon previous work that demonstrates the feasibility of presenting a set as an input to autoencoders through content-based attention mechanisms. However, since model inputs and outputs are permutation invariant, we develop an end-to-end architecture that approximates the solution of a linear sum assignment problem, i.e., a minimum-cost bijective mapping problem, to ensure a match between the elements of the input and the output set for effective loss calculation. We demonstrate the model capability to learn a combined representation while preserving individual mode characteristics focusing on the task of reconstructing multi-omic cancer data. The code is made publicly available on Github https://github.com/PaccMann/fdsa ).