Publication
CVPR 2024
Workshop paper

QAttn: Efficient GPU Kernels for mixed-precision vision transformers

Abstract

Vision Transformers have demonstrated outstanding performance in Computer Vision tasks. Nevertheless, for large models this superior performance comes at the expense of an increase in memory usage for storing the parameters and intermediate activations. To accelerate model inference, in this work we develop and evaluate integer and mixed-precision kernels in Triton for the efficient execution of two fundamental building blocks of transformers –linear layer and attention– on graphics processing units (GPUs). On an NVIDIA A100 GPU, our implementations of Vision Transformers achieve a throughput speedup of up to seven times compared with similar kernels in PyTorch FP32 (single precision). Additionally, the accuracy of the ViT Large model top-1 accuracy drops by less than one percent on ImageNet1K classification task. Furthermore, our kernels demonstrate speed comparable to the TensorRT INT8 linear layer, and we improve the performance of base FP16 (half precision) Triton attention by up to 20%. We have open-sourced the QAtnn (Quantized Attention pronounced like katana) framework, which is tightly integrated with the PyTorch quantization workflow.