Alain Vaucher, Matteo Manica, et al.
ACS Spring 2023
Structure elucidation is integral in the day-to-day operation of any organic chemistry laboratory allowing the structure and composition of unknown substances to be determined. Most commonly this is achieved via different spectroscopic techniques. First among them Nuclear Magnetic Resonance (NMR), Infrared (IR) spectroscopy and Mass Spectrometry (MS). While the acquisition of the spectra has been largely automated, the analysis of them is not straightforward making the analysis of spectra, particularly in large quantities, a time-consuming and tedious undertaking.
We demonstrated that Transformer models are capable of predicting the exact chemical structure from NMR and IR spectra. However, these models could only interpret one modality i.e. either the NMR or the IR spectrum at a time. This is fundamentally different from how a chemist would determine the structure of an unknown compound extracting information from multiple different spectroscopic modalities. Here we present a model that is capable of emulating this approach. Our model is not only capable of predicting the correct structure from the combination of different spectra but also each modality on its own. We pretrained our model on a large set of simulated spectra before finetuning on a smaller dataset of experimental spectra achieving a Top-1 accuracy of up to 96%. In addition, we rigorously compare the performance of our model to human chemists and provide a set of ablations on the synergistic effects of each spectroscopic modality.