Why it's time to "free the turtles."
Since the beginning of cloud computing, virtual machines (VMs) have been a fundamental technology for deploying services. Among other things, VMs allow for easy deployment, better resource utilization and flexibility. With the rise of container technologies, one might think that VMs are no longer critical for cloud computing, but this is far from true.
Container orchestration solutions like Kubernetes (K8s) and OpenShift (OCP) need to be deployed onto some kind of infrastructure, and in the cloud, the most common form of IaaS (Infrastructure as a Service) is virtual machines. So a classic deployment follows the cloud computing pyramid shown below; K8s/OpenShift is a PaaS (Platform as a Service) offering built on top of a VM-based IaaS offering.
K8s and OpenShift users are becoming increasingly interested in leveraging VMs inside their K8s clusters to deliver stronger workload isolation. This has led to the growing popularity of technologies like KubeVirt is a technology that helps to deploy VM-based workloads in K8s clusters. Users can easily manage/run their VMs in a K8s-based platform along with traditional containerized workloads. More details can be found here.KubeVirt and Kata increases the container boundary by running it within a dedicated VM, which provides better security and isolation compared to traditional containers. Kata containers can also be easily deployed in a K8s-based platform. Read more here.Kata Containers. Since Kubernetes nodes are already deployed in IaaS VMs, adding a second layer of VMs, as is done in the KubeVirt or Kata Containers projects, requires the use of nested virtualization. Each K8s/OpenShift worker node is a level-1 (L1) VM and the containerized workload runs in level-2 (L2), or nested, VMs.
About a decade ago, IBM was at the forefront of open-source nested VM technologies for x86 systems with the Turtles Project, which became the foundation of the current nested virtualization implementation in Linux/KVM. This now-famous paper opens with a quote, referencing a phrase which has become synonymous with infinite recursion:
"The scientist gave a superior smile before replying, "What is the tortoise standing on?" "You’re very clever, young man, very clever," said the old lady. "But it’s turtles all the way down!""
Nested virtualization is not supported by most cloud providers. There are security concerns with implementing nested virtualization as it enlarges the code base of host hypervisors, which expands the attack surface due to Including CVE-2021-3656 and CVE-2021-29657.known security bugs. Nested VMs also have poor I/O performance, which has been Including in Hybrid^2 Nested VM IO Performance Tuning by Lan and Peng.discussed extensively in the community. Nested virtualization also has incompatibility issues with emerging technologies for confidential computing such as AMD's Secure Encrypted Virtualization (SEV) and Intel's Trust Domain Extensions (TDX), which enable workloads to run inside a secure and encrypted enclave. These enclaves can only be exposed through a single layer of virtualization.
Today, for a user to take advantage of technologies like KubeVirt, Kata Containers, or solutions like SEV and TDX, K8s/OpenShift deployments would need to run on bare-metal servers, because there is little support for nested virtualization. Unfortunately, bare-metal servers are expensive because individual workloads do not require all available resources on the node, and fundamentally bare metal servers are not shared.
Our research team sees a path past nested virtualization. We foresee an environment where IaaS VMs and client-requested VMs run at the same virtualization level, flattening the hierarchy and removing VM nesting. We are saying free the "turtles" — let them live side by side, not stacked on top of each other. This approach can be possible while still keeping some of the benefits of nested VMs, for example the notion that resources of the L2 VM are carved out from the L1 VM, and the L1 VM has control over the lifecycle of the L2 VMs it created.
In our solution, standard VMs (PriVMs) have the ability to ask the host hypervisor to "spawn" special VMs, called SecondaryVMs (SecVMs). These cannot allocate new resources, but can share the resources of the PriVM that spawned it. This is done by a mix of technologies, including (1) Hot-Plug, to dynamically move resources like memory, CPUs and devices between VMs, and (2) cgroups, to guarantee that a group of PriVM+SecVMs will never use more resources than the ones originally allocated to the PriVM. The PriVM and its SecVMs are connected through a virtual private network: all the traffic between the SecVM and the outside has to go through the PriVM. Disks can be shared or moved between PriVM and SecVMs.
Our approach is to enable this concept with as few changes as possible to the open-source virtualization ecosystem. We believe we can isolate most of our changes to libvirt or a similar layer, by adding a special communication channel based on vsocks, that would allow the PriVM to issue requests to libvirt with limited privileges. The rest of the technologies we need for this are all standard in Linux/KVM-based hypervisors.
This approach is similar to what Amazon has done with AWS Nitro Enclaves: Resources are carved-out of the main VM to execute a small, limited VM as an enclave. However, our approach is much more general and allows virtually any type of VM to be spawned as a SecVM, as long as it's only using resources pre-allocated to the PriVM. We aim to open source our API and allow any virtualization layer to easily incorporate this feature, and to allow cross-communication and portability of all services that want to take advantage of this framework.
We will be presenting more details of our SecVM framework at KVM Forum 2022. If you are interested in learning more details about our work, please consider attending KVM Forum 2022 to listen to our talk in person or remotely.