NeurIPS 2021
Workshop paper

Leveraging Adversarial Reprogramming for Novel Structure-constrained Protein Sequence Design


Designing novel and diverse protein sequences consistent with a given structure is an important task towards scientific discovery. Recently, deep language models that learn from large unlabeled corpus have shown impressive success in protein sequence generation. Since only a small fraction of the entire sequence corpus has structural annotation available, training a model to generate structure-constrained sequences from scratch can lead to degraded performance. The method of Adversarial Reprogramming (AR) focuses on repurposing pre-trained machine learning models for target domain tasks with scarce data, where it may be difficult to train a high-performing model from scratch. Prior works in AR have primarily focused on classification-based tasks. In our work, we seek to extend the capabilities of reprogramming beyond classification tasks, and towards a more complex problem of sequence generation tasks in the molecular space. Specifically, we repurpose pre-trained language models used for text-infilling, to infill protein sequence templates as a method of novel protein generation. In doing so, we demonstrate that via AR, sequence generation for low resourced data is both achievable while still upholding the structural integrity of the sequences.