Growing strings in a chemical reaction space for searching retrosynthesis pathways
Machine learning algorithms demonstrated remarkable accuracy in predicting the outcomes of chemical reactions, often outperforming human experts. Simultaneously, a high level of precision was achieved in the single-step retrosynthesis prediction problem. However, designing a synthesis pathway leading to a given product is a challenging task that runs up against the limits of many currently available ML-driven algorithms. Like the game of chess, retrosynthesis route prediction entails putting together a series of steps to create a given product from existing substances, with the goal of optimizing the synthesis efficiency by taking advantage of specific strategical game rules like protection, deprotection, FGI, etc. Because current machine learning models are trained on single reaction steps, they lack knowledge of these strategy rules. Here, we recast the retrosynthesis problem as a string optimization problem, capitalizing on the homology between the chemical reaction space and a multidimensional geometrical space. If we think of chemical reactions as multidimensional vectors (fingerprints), then a synthesis in this space is a string that involves three or more connected fingerprints. An extensive corpus of chemical synthesis, comprising approximately 1.2M examples, was extracted and added as strings to the chemical reaction space. We use the Euclidean metric to minimize the distance between the trajectory of the growing retrosynthesis string and the existing strings. By doing so, we aim to assemble steps that, in the chemical reaction space, will grow along paths more similar to existing retrosynthesis, thereby inheriting the strategic guidelines compiled by domain experts. We integrated this approach into the RXN platform (https://rxn.res.ibm.com/) and present the method's application to complex synthesis as well as its ability to produce better synthetic strategies than current methodologies.