ACS Central Science

Unbiasing Retrosynthesis Language Models with Disconnection Prompts

Download paper


Data-driven approaches to retrosynthesis have thus far been limited in user interaction, in the diversity of their predictions, and in the recommendation of unintuitive disconnection strategies. Herein, we extend the notions of prompt- based inference in natural language processing to the task of chemical language modelling. We show that by using a prompt describing the disconnection site in a molecule, we are able to steer the model to propose a wider sets of precursors, overcoming training data biases in retrosynthetic recommendations and achieving a 39 % performance improvement over the baseline. For the first time, the use of a disconnection prompt empowers chemists by giving them back greater control over the disconnection predictions, resulting in more diverse and creative recommendations. In addition, in lieu of a human-in-the-loop strategy, we propose a schema for automatic identification of disconnection sites, followed by prediction of reactant sets, achieving a 100 % improvement in class diversity as compared to the baseline. The approach is effective in mitigating prediction biases deriving from training data. In turn this provides a larger variety of usable building blocks, which improves the end-user digital experience. We demonstrate its application to different chemistry domains, from traditional to enzymatic reactions, in which substrate specificity is key.