ACS Fall 2021

POCSTagger: Identifying part-of-chemical-speech with transformers


In the quest to build better automatic retrosynthetic tools, the ability to interface artificial intelligence models with a more traditional computational chemistry software becomes of paramount importance. Language-based models for retrosynthesis, like the ones in IBM RXN For Chemistry, output sequences of retrosynthetic steps represented as reaction SMILES. The construction of reaction network using atomistic modelling schemes requires the knowledge of the role of the individual molecules in a reaction equation: the solvent needs to be treated explicitly or implicitly and the catalysts many time undergo peculiar preprocessing/transformations in computational chemistry tools. Manual labeling is laborious and cannot be done for all predicted routes. Therefore an automated process is required. In this context, we developed a part-of-speech tagging AI model based on the BERT transformer architecture used in Natural Language Processing. After pretraining on reaction patent data and subsequently adding a classification layer to the network, our models accurately predict the roles of the different components involved in chemical reactions. This work brings us one step closer to automated validation of synthesis routes.