FHENDI: A Near-DRAM Accelerator for Compiler-Generated Fully Homomorphic Encryption Applications
Abstract
Fully homomorphic encryption (FHE) is a powerful cryptographic technique that enables computation on encrypted data without needing to decrypt it. It has broad applications in scenarios where sensitive data needs to be processed in the cloud or in other untrusted environments. FHE applications are both compute- and memory-intensive, owing to expensive operations on large data. While prior works address the challenges of efficient compute using dedicated hardware, expensive memory transfers still remain a major limiting factor. In this work, we propose a hierarchical near-DRAM processing (NDP) solution for FHE applications, called FHENDI, that harnesses the massive DRAM bank bandwidth. We observe various data access patterns in FHE that reveal distinct levels of parallelism: element-wise, limb-wise, coefficient-wise, and ciphertext-wise. FHENDI exploits these levels of parallelism to map FHE operations and data onto different hierarchies of our design, while addressing three major challenges with NDP for FHE: (i) the lack of bank-to-bank communication support, (ii) limited die-to-die bandwidth, and (iii) large memory access latencies. We resolve the first problem through a novel, conflict-free mapping algorithm built atop localized permutation networks that enables efficient element-wise and butterfly operations in FHE. The second problem is addressed by pipelining the execution of parallel bootstrap operations observed in compiled FHE workloads. Finally, we hide the memory access latency behind computation latency by exploiting a dual-banking scheme and subarray-level parallelism (SLP) of the DRAM banks. We evaluate FHENDI using representative workloads in the domains of privacy-preserving machine learning inference on CNNs and Transformers, database range query, and sorting, that are obtained using a compiler framework called HElayers. We compare FHENDI with a server-class CPU and GPU running the state-of-the-art HEaaN library, and an FHE accelerator ASIC, and report mean speedups of 2145.8x, 118.29x, and 2.45x, respectively.