About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
Digital Discovery
Paper
Chemical representation learning for toxicity prediction
Abstract
Undesired toxicity is a major hindrance in drug discovery and largely responsible for high attrition rates in the early stages. This calls for new, reliable, and interpretable molecular property prediction models that help to prioritize compounds and thus reduce the high costs for development and the risk to humans, animals, and the environment. Here, we propose ToxSmi, an interpretable chemical language model that combines self-attention with multiscale convolutions and relies on data augmentation. We first benchmark various molecular representations (e.g., fingerprints, different flavors of SMILES and SELFIES, as well as graph and graph kernel methods) revealing that SMILES coupled with augmentation overall yields the best performance. Despite its sim-plicity, ToxSmi is then shown to outperform existing approaches across a wide range of molecular property prediction tasks, including but not limited to toxicity. Moreover, the attention weights of ToxSmi allow for easy interpretation and show enrichment of known toxicophores even without explicit supervision. To introduce a notion of model reliability, we propose and combine two simple methods for uncertainty estimation (Monte-Carlo dropout and test-time-augmentation). These methods not only identify samples with high prediction uncertainty, but also allow forming implicit model ensembles that improve accuracy. Last, we validate ToxSmi on a large-scale proprietary toxicity dataset and find that it outperforms previous work while giving similar insights into revealing cytotoxic substructures.