ACS Fall 2023

Molecular dynamics as a data source: scaling simulation for building AI models


Molecular dynamics simulation is well-established as a technique contributing to drug and materials discovery. Increasingly important is its use as a data source for training AI models. Scaling the scope and size of such data sets will be key to building foundation models based on large-scale and diverse information. We use an IBM-developed open-source toolkit, Simulation Toolkit for Scientific Discovery (ST4SD), to automate simulation workflows. These workflows can be readily scaled to take full advantage of traditional high-performance computing and emerging OpenShift clusters. We then show how large-scale simulation data can be digested by graph-based, deep neural networks that our team has designed. We build a model for antigen-peptide immunogenic prediction that outperforms hand-engineered features trained on the same dataset and is further shown to outperform state-of-the-art sequence-based models in the low-data regime.