Topology-Driven Completion of Chemical Data
Abstract
Discovery of functional materials requires efficient exploration in the chemical space. We introduce an approach that identifies lacunae in the chemical data and completes them in a targeted manner. We start with topological data analysis (TDA) [1] on a set of molecules producing an approximation to Reeb graph where loops and branches are indicative of missing data. Second, we generate novel molecules that complete loops/branches on TDA graph using a modified graph-generative model for scaffold-based molecular design. The generation is conditioned on the existing scaffolds, making sure that all generated molecules contain the input scaffold. The loss function is modified to account for the generative potential of the scaffolds, gsn. We reduce the influence of the scaffolds with low gsn and penalize generation of molecules with low gsn. The application of this approach to the exploration of photo-acid generators is discussed. [1] Gurjeet, S et al. "Topological methods for the analysis of high dimensional data sets and 3d object recognition." SPBG (2007). [2] Dijkstra, Edsger W. "A note on two problems in connexion with graphs." Numerische mathematik 1.1 (1959): 269-271. [3] Lim, J. et al. "Scaffold-based molecular design with a graph generative model." Chemical Science (2020)