CLOUD 2024
Conference paper

Securing AI Inference in the Cloud: Is CPU-GPU Confidential Computing ready ?


Many applications have been offloaded onto cloud environments to achieve higher agility, access to more powerful computational resources, and obtain better infrastructure management. Although cloud environments provide solid security solutions, users with highly sensitive data or regulatory compliance requirements, such as HIPAA (Health Insurance Portability and Accountability Act) and GDPR (General Data Protection Regulation), still hesitate to move such application domains to the cloud. To address these concerns, cloud service providers have started to offer solutions to protect data confidentiality and integrity through trusted execution environments (TEEs). While so far these were limited to CPU TEEs only, NVIDIA’s Hopper architecture has shifted the landscape by enabling confidential computing features essential to protecting confidentiality and integrity for real-world applications offloaded to GPUs, such as large language models (LLMs). However, there lacks a sufficient study on how much performance overhead confidential computing introduces in a TEE comprised of a CPU-GPU configuration. In this paper we evaluate a confidential computing environment comprised of an Intel TDX system and NVIDIA H100 GPUs through various micro benchmarks and real workloads including BERT, LLaMA, and Granite large language models and provide discussions on the overhead incurred by confidential computing when GPUs are utilized. We show that while LLMs are sensitive to the model types and batch sizes, when larger models with pipelined processing are deployed, the performance of LLM inference in CPU-GPU TEEs can be close to par with their non confidential setups.