Recurrent Transformers Trade-off Parallelism for Length Generalization on Regular Languages

Paul Soulos; Aleksandar Terzic; Michael Hersche; Abbas Rahimi

NeurIPS 2024

Workshop paper

10 Dec 2024

Recurrent Transformers Trade-off Parallelism for Length Generalization on Regular Languages

Download paper

Abstract

Transformers have achieved remarkable success in Natural Language Processing but struggle with state tracking and algorithmic reasoning tasks, such as modeling Regular Languages. In contrast, Recurrent Neural Networks (RNNs) exhibit perfect generalization modeling Regular Languages. To bridge this gap, we explore Recurrent Transformer variants that incorporate chunking, balancing the parallelizability of Transformers with the sequential processing of RNNs. We identify layer-recurrence as the key type of recurrence that allows Recurrent Transformers to succeed in modeling Regular Languages. Further analysis indicates a rapid decline in generalization performance as chunk size increases beyond two, though with an exponential decrease in training time. This study underscores the critical role of layer-recurrence and chunk size in Recurrent Transformers, highlighting the trade-off between generalization capabilities and parallelism.

Conference paper