Talk

From Device Passthrough to Host Passout

Abstract

Cloud providers increasingly offer bare-metal instances and GPU nodes with full PCI(e) device passthrough to support demanding AI and HPC workloads. While this approach delivers native performance, it also exposes low-level device configuration interfaces directly to tenant software with minimal oversight. We reveal a critical cross-layer reliability issue: tenant-accessible PCI(e) configuration-space operations can trigger system-wide failures that escape device boundaries and manifest as platform-level Reliability, Availability, and Serviceability (RAS) errors. These failures arise from interactions between undocumented device registers, fragile hardware state transitions, and inconsistent platform error handling, creating a gap between assumed device isolation and actual failure propagation. Using a systematic configuration-space exploration approach, we reproduced these failures across multiple device classes (GPUs, NVMe drives, network adapters) in production-grade server platforms. We demonstrate that hardware-enforced isolation alone is insufficient for RAS guarantees when configuration interfaces remain unvalidated. As a practical countermeasure, we show how hypervisor-level filtering can block unsafe operations while maintaining device functionality.