ACS Spring 2023

Enzyme optimization via a generative language modeling-based evolutionary algorithm


Enzymes are molecular engines that nature has designed to allow otherwise impossible chemical reactions to take place. They present exceptional properties, making them appealing for more sustainable reactions: mild conditions, fewer toxic solvents, and less waste. Billion years of evolution have made enzymes extraordinarily efficient. However, widespread application in industrial processes necessitates speedier design employing in-silico approaches, a demanding endeavor far from being completed. Most approaches act by introducing mutations into an existing amino acid (AA) sequence, employing various assumptions and methodologies. Machine learning and deep generative networks have lately gained attention within the protein engineering community. Especially their extensions that exploit preexisting information on protein binders, physico-chemical characteristics, or 3D structures. We treat the problem of enzyme optimization as an evolutionary process, where we model mutations via a generalized autoregressive language model trained on fragments of AA sequences from UniProtKB. We use transfer learning to drive the optimization process and train a Random Forest as the scoring model on a dataset of biocatalyzed chemical processes using pre-trained molecular representations. By doing this, we can alter active sites to catalyze novel reactions by making minimal assumptions. Our approach enables the creation of enzymes with better anticipated biocatalytic activity, simulating the natural evolutionary process by picking ideal sequences that reflect the underlying proteomic language.