npj Computational Materials

Growing strings in a chemical reaction space for searching retrosynthesis pathways

View publication


Machine learning algorithms have shown great accuracy in predicting chemical reaction outcomes and retrosyntheses. However, designing synthesis pathways remains challenging for existing machine learning models which are trained for single-step prediction. In this manuscript, we propose to recast the retrosynthesis problem as a string optimization problem in a data-driven fingerprint space, leveraging the similarity between chemical reactions and embedding vectors. Based on this premise, multi-step complex synthesis can be conceptualized as sequences that link multidimensional vectors (fingerprints) representing individual chemical reaction steps. We extracted an extensive corpus of chemical synthesis from patents and converted them into multidimensional strings. While optimizing the retrosynthetic path, we use the Euclidean metric to minimize the distance between the expanded trajectory of the growing retrosynthesis string and the corpus of extracted strings. By doing so, we promote the assembly of synthetic pathways that, in the chemical reaction space, will be more similar to existing retrosyntheses, thereby inheriting the strategic guidelines designed by human experts. We integrated this approach into the RXN platform ( and present the method’s application to complex synthesis as well as its ability to produce better synthetic strategies than current methodologies.