About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ACS Spring 2022
Conference paper
Identification of enzymatic active sites with unsupervised language modelling
Abstract
The first decade of genome sequencing saw a surge in the characterisation of proteins with unknown functionality. Even still, more than 20% of proteins in well-studied model animals have yet to be identified, making the discovery of their active site one of biology's greatest difficulties. Herein, we apply a transformer architecture to a language representation of bio-catalyzed chemical reactions to learn the signal at the base of the substrate-active site atomic interactions. The language representation comprises a reaction simplified molecular-input line-entry system (SMILES) for substrate and products, complemented with amino acid (AA) sequence information for the enzyme. Defining a custom tokenizer and a score based on attention values, we show we can capture the substrate-active site interaction signal and use it to detect the location of the active site in unknown protein sequences, hence elucidating complex 3D interactions solely relying on 1D representations. We consider a Transfomer-based model, BERT, trained with different losses and analyse the performance in comparison with a statistical baseline and methods based on sequence alignments. Our approach exhibits remarkable results and is able to recover, with no supervision, 31.51% of the active site when considering co-crystallized substrate-enzyme structures as a ground truth, largely outperforming sequence alignment-based approaches. Our findings are further corroborated by docking simulations on the 3D structure of few enzymes. This work confirms the unprecedented impact of natural language processing and more specifically of the transformer architecture on domain-specific languages, paving the way to effective solutions for protein functional characterisation and bio-catalysis engineering.