Ilias Iliadis
International Journal On Advances In Networks And Services
Cloud providers increasingly offer bare-metal instances and GPU nodes with full PCI(e) device passthrough to support demanding AI and HPC workloads. While this approach delivers native performance, it also exposes low-level device configuration interfaces directly to tenant software with minimal oversight. We reveal a critical cross-layer reliability issue: tenant-accessible PCI(e) configuration-space operations can trigger system-wide failures that escape device boundaries and manifest as platform-level Reliability, Availability, and Serviceability (RAS) errors. These failures arise from interactions between undocumented device registers, fragile hardware state transitions, and inconsistent platform error handling, creating a gap between assumed device isolation and actual failure propagation. Using a systematic configuration-space exploration approach, we reproduced these failures across multiple device classes (GPUs, NVMe drives, network adapters) in production-grade server platforms. We demonstrate that hardware-enforced isolation alone is insufficient for RAS guarantees when configuration interfaces remain unvalidated. As a practical countermeasure, we show how hypervisor-level filtering can block unsafe operations while maintaining device functionality.
Ilias Iliadis
International Journal On Advances In Networks And Services
Nikoleta Iliakopoulou, Jovan Stojkovic, et al.
MICRO 2025
Alessandro Pomponio
Kubecon + CloudNativeCon NA 2025
Haoran Qiu, Weichao Mao, et al.
ASPLOS 2024