Chemical representation learning for toxicity prediction
Undesired toxicity is a major hindrance in drug discovery and largely responsible for high attrition rates in the early stages. This calls for new, reliable, and interpretable molecular property prediction models that help to prioritize compounds and thus reduce the high costs for development and the risk to humans, animals, and the environment. Here, we propose ToxSmi, an interpretable chemical language model that combines self-attention with multiscale convolutions and relies on data augmentation. We ﬁrst benchmark various molecular representations (e.g., ﬁngerprints, diﬀerent ﬂavors of SMILES and SELFIES, as well as graph and graph kernel methods) revealing that SMILES coupled with augmentation overall yields the best performance. Despite its sim-plicity, ToxSmi is then shown to outperform existing approaches across a wide range of molecular property prediction tasks, including but not limited to toxicity. Moreover, the attention weights of ToxSmi allow for easy interpretation and show enrichment of known toxicophores even without explicit supervision. To introduce a notion of model reliability, we propose and combine two simple methods for uncertainty estimation (Monte-Carlo dropout and test-time-augmentation). These methods not only identify samples with high prediction uncertainty, but also allow forming implicit model ensembles that improve accuracy. Last, we validate ToxSmi on a large-scale proprietary toxicity dataset and ﬁnd that it outperforms previous work while giving similar insights into revealing cytotoxic substructures.