Mengmei Ye, Angelo Ruocco
KVM Forum 2022
As models scale beyond trillions of parameters, extending their functionality is increasingly achieved through fine-tuning existing base models rather than training new ones from scratch. However, fine-tuning all parameters remains computationally expensive. Recent techniques such as Low-Rank Adaptation (LoRA) have been developed to reduce the number of trainable parameters. LoRA adapters have gained widespread adoption, but their effects on GPU system metrics, such as throughput and energy efficiency, are not yet well understood. In this study, we examine these system-level metrics as a function of the LoRA adapter rank. Our findings show that reducing the rank of LoRA adapters does not lead to a significant drop in model quality, while simultaneously improving throughput, energy efficiency, and memory usage. Further, we find that the presence of a LoRA adapter, rather than its rank size, can greatly improve model quality compared to a zero-shot inference base model. This makes smaller LoRA adapters a compelling choice for a variety of applications.
Mengmei Ye, Angelo Ruocco
KVM Forum 2022
Elaine Palmer
OCP Global Summit 2020
Marcelo Amaral, Huamin Chen, et al.
CLOUD 2023
Dionysios Diamantopoulos, Burkhard Ringlein, et al.
CLOUD 2023