ACS Fall 2023

Automatic structure elucidation from NMR spectra


Machine learning models in chemistry have made impressive progress in recent years. From enhancing retrosynthesis over folding proteins to predicting new drug candidates, the field has seen immense advances.[1–3] While the application of machine learning in analytical chemistry has also seen increased attention, machine learning based methods have so far not been adopted into everyday use by bench chemists. NMR spectroscopy is among the most powerful analytical instruments available to chemists. It can be used to characterise molecular structure, determine complicated stereochemistry, and quantify mixtures. Although chemists regularly use NMR, and numerous programs exist to help process spectra, fully automated structure elucidation remains conceptual in practice. Machine learning may be a valuable tool that could allow automatic structure elucidation. While previous attempts to use machine learning to characterise molecules spectra have been limited, one successful example involves determining the structure of compounds with up to 10 heavy atoms. However, this model requires high resolution 1H and 13C NMR data, and was trained on simulated spectra. In this work we introduce a novel machine learning approach to predict the molecular structure directly from the 1H NMR. To achieve this, we developed a transformer model trained on 1H NMRs that predicts molecular structures as SMILES strings. We obtained 1H NMRs from the experimental sections of the patent reactions in NextMove’s Pistachio dataset.[4] The model takes the chemical formula in addition to the 1H NMR in text form as input. In contrast to previous work, we include molecules with a heavy atom count from 10 to 35. We trained the model on approximately 750,000 examples. Our model achieves a top 1 accuracy of 21.2% and a top 10 accuracy of 40.7%. The model’s accuracy could be further improved by including the 13C NMR and finetuning on more detailed NMR data.