ACS Fall 2023

Topology-driven pre-training for robust molecular property prediction models


Deep learning models have shown great potential in predicting molecular properties. These models need to learn latent representations that capture the intrinsic geometry of the molecules; they should preserve symmetries. To address this problem, we propose a strategy for pre-training such models using 2D molecular graphs that exploits a topological invariant, based on simplicial homology. This invariant is computed as a node-level feature, capturing both local and global structure of the graph; more specifically, given a graph G and a node v, consider G-v, the largest subgraph of G not containing v, and we compute its Betti numbers beta0 (G-v) and beta1 (G-v). Essentially, we are removing a node and counting connected components and cycles that remain in the graph. We first pre-train a graph-aware transformer model using this objective function to learn the underlying structural features of molecules. We then fine-tune the model on the target molecular property prediction task. We evaluate our approach on several benchmark datasets and show that our pre-training strategy consistently improves the performance of the model compared to other pre-training methods. Our results demonstrate the effectiveness of incorporating topology information in pre-training molecular property prediction models and highlight the potential of our approach in advancing the field.