Automatic structure elucidation from IR spectra
Abstract
The application of machine learning models in chemistry has made remarkable strides in recent years. From enhancing retrosynthesis over expediting DFT calculations to predicting new drug candidates, the field has seen immense progress. Although there has been increased interest in the field of analytical chemistry, machine learning based methods have so far not been adopted into everyday use by bench chemists. Of the analytical instruments that are commonly available to the chemist, Infrared (IR) spectroscopy has receded in importance with the advent of more powerful structure elucidation tools such as nuclear magnetic resonance (NMR) and liquid chromatography–mass spectrometry (LC/MS). While chemists routinely identify functional groups from IR spectra, obtaining further information from them is challenging. Previous work on applying machine learning to IR spectroscopy has focused on identifying functional groups, and very few attempts at predicting the molecular structure directly have been published. In this work we introduce a novel machine learning approach to predict the molecular structure directly from the IR fingerprint region. To achieve this, we developed a transformer model trained on IR spectra (400-2000 cm-1) that predicts molecular structures as SMILES strings. In addition, we assessed the impact of appending the chemical formula to the input string, enhancing the accuracy of the model. Given the lack of large and high-quality experimental IR spectra databases, we generated a training set of 650,000 simulated IR spectra using molecular dynamics. Our approach achieved a top 1 accuracy of 29.7% and a top 10 accuracy of 62.8% on a test set sampled from PubChem with a heavy atom count ranging from 6 to 13. The model obtained in this fashion provides a pre-trained model that can be fine-tuned on smaller experimental datasets.