Multimodal Transformer models for Structure Elucidation from Spectra
Abstract
The application of machine learning models in chemistry has made remarkable strides in recent years. From expediting DFT calculations over enhancing retrosynthesis to translating a synthesis into robot executable steps, the field has seen immense progress. To further streamline automation and feedback loops in chemical synthesis, it will be necessary to demonstrate similar advances in analytical chemistry as well. We have recently shown that it is possible to determine chemical structures from spectral data using language models. In particular, we demonstrated for the first time that structure elucidation can be accomplished solely from IR spectra. Additionally, we illustrated that language models can accurately predict the chemical structure based on processed NMR spectra consisting only of the peak position, integration and type of multiplet. However, both studies relied on converting spectral information, a vector, into a string-based representation. This process inevitably leads to a loss of valuable information which could be crucial for more informed model predictions. In this work we introduce a multimodal transformer capable of interpreting both textual information and spectra (represented as vectors). We base our model on an encoder-decoder architecture feeding both textual and vector information jointly into the encoder. Taking inspiration from the Vision Transformer architecture, we subdivide spectra into patches and embed them using convolutional kernels. This allows us to leverage all of the information present in a spectrum and enable more predictions with greater accuracy.