A Mamba-Based Foundation Model for Chemistry

Emilio Ashton Vital Brazil; Eduardo Almeida Soares; Victor Shirasuna; Renato Fontoura de Gusmao Cerqueira; Dmitry Zubarev; Kristin Schmidt

NeurIPS 2024

Workshop paper

10 Dec 2024

A Mamba-Based Foundation Model for Chemistry

Abstract

Large-scale pre-trained foundation models represent a breakthrough in cheminformatics. These methods excel in tasks such as property prediction and molecule generation by learning contextualized representations of input tokens through self-supervised learning on large unlabeled corpora. Most chemical foundation models available are based on the Transformers architecture and its core attention module. The efficacy of self-attention is attributed to its ability to route information densely within a context window, allowing it to model complex data. However, this property brings fundamental drawbacks: an inability to model anything outside of a finite window and quadratic scaling with respect to the window length. Structured state space sequence models (SSMs) have recently emerged as a promising class of architectures for sequence modeling. Mamba is a simplified end-to-end SSM-based neural network architecture without attention or even MLP blocks. This paper introduces a Mamba-based chemical foundational models pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, which is equivalent to 4 billion of molecular tokens. These models support different complex tasks, including molecular properties prediction, classification, molecular reconstruction, and synthesis yield prediction. Our experiments across multiple benchmark datasets validate the SSM's capacity of providing state-of-the-art results while is designed for fast inference.

Conference paper