A case for using cache line deltas for high frequency VM snapshotting

Daniel Waddington; Moshik Hershcovitch; Swaminathan Sundararaman; Clem Dickey

doi:10.1145/3542929.3563481

SoCC 2022

Conference paper

07 Nov 2022

A case for using cache line deltas for high frequency VM snapshotting

View publication

Abstract

Active-standby schemes for Virtual Machine (VM) high availability require periodic synchronization of memory and CPU state. The most common approach to synchronization is to use page tables and software to identify "dirty"memory pages at the source and in turn copy them to the target via a network or interconnect. However, this approach results in significanct page table traversal and data copying overhead, resulting in considerable VM downtime. A principal contributor to this overhead is that many applications using this approach incur data copy-amplification as a result of copying more data than is necessary; this arises because of the processor's virtual memory system design in which memory pages are 4KiB or larger. With the emergence of CXL-enabled memory devices, it is now possible to track memory changes at a finer-granularity (e.g., 64-byte cache lines instead of 4KiB pages). Moreover, CXL will enable new functions to be pushed down into custom memory controllers that can directly intercept and manipulate memory transactions. This paper is a use case analysis that examines the potential advantages of moving to cache line-based memory change detection and transfer. We focus on exploring continuous synchronization of VM guest memory spaces for the purpose of achieving high availability. For this use case, the maximum outage-time, resulting from snapshot and synchronization latency, must be kept to a minimum. Our analysis examines memory access patterns from 30 different benchmarks and derives a quantitative understanding of potential gains CXL-based technology can offer. The results show that more than 35% of the benchmarks exhibit an amplification factor of greater than 10 and therefore would significantly benefit from the finer granularity proposed. Furthermore, to reduce the data transfer between machines, we combine fine-grained tracking with compression to reduce data copy volume. Our results show that an additional reduction of 1.8x can be achieved with XOR-RLE compression.

Conference paper