ISCA 2024
Conference paper

HAL: Hardware-assisted Load Balancing for Energy-efficient SNIC-Host Cooperative Computing


A typical SmartNIC (SNIC) integrates a processor, consisting of Arm CPU and accelerators, with a conventional NIC. The processor is designed to energy-efficiently execute functions frequently used by network-intensive datacenter applications. With such a processor, the SNIC has promised to notably increase overall energy efficiency of datacenter servers. Nevertheless, the recent trend of integrating accelerators into server CPUs for these functions sparks questions on the SNIC processor’s superiority over a host processor (i.e., server CPU with accelerators) in system-wide energy efficiency especially under given tail latency constraints. Answering this pressing question, we first take an Intel Xeon processor, integrated with various accelerators (i.e., Quick Assist Technology), as a host processor, and then compare it to an NVIDIA BlueField-2 SNIC processor. This reveals that (1) the host accelerators, coupled with a more powerful memory subsystem, can outperform the SNIC accelerators and (2) the SNIC processor can improve systemwide energy efficiency over the host processor only at low packet rates for most functions under tail latency constraints. To offer high system-wide energy efficiency without hurting tail latency at any packet rates, we propose HAL consisting of a hardware-based load balancer and an intelligent load balancing policy implemented inside the SNIC. When HAL detects that the SNIC processor cannot efficiently process a given function beyond a specific packet rate, it limits the rate of packets to the SNIC processor and lets the host processor handle the excess. HAL works for stateless functions with conventional PCIe-attached SNICs for now, but we also demonstrate that HAL can work for stateful functions as effectively with a CXL-attached SNIC. We implement HAL with FPGA connected to the BlueField-2 SNIC and show that HAL makes the SNIC processor improve energy efficiency and throughput of the server by 31% and 10%, respectively, without notably hurting tail latency.