Molecular transformer-aided biocatalysed synthesis planning
Enzyme catalysts are an integral part of green chemistry strategies towards a more sustainable and resource-efficient chemical synthesis. However, the retrosynthesis of given targets with biocatalysed reactions remains a significant challenge: the substrate specificity, the potential to catalyse unreported substrates, and the specific stereo- and regioselectivity properties are domain-specific knowledge factors that hinders the adoption of biocatalysis in daily laboratory works. Here, we use the molecular transformer architecture to capture the latent knowledge about enzymatic activity from a large data set of publicly available enzymatic data, extending forward reaction and retrosynthetic pathway prediction to the domain of biocatalysis. We introduce a class token based on the EC classification scheme that allows to capture catalysis patterns among different enzymes belonging to same hierarchical families. The forward prediction model achieves a top-5 accuracy of 62.7%, while the single step retrosynthetic model shows a top-1 round-trip accuracy of 39.6%. The enzymatic data and the trained models are available through the RXN for Chemistry network (https://rxn.res.ibm.com and https://github.com/rxn4chemistry).