The SawMill multiserver approach

Alain Geffaut; Jochen Liedtke; Jonathon E. Tidswell; Trent Jaeger; Kevin J. Elphinstone; Luke Deller; Yoonho Park; Volkmar Uhlig; Lars Reuther

doi:10.1145/566726.566751

EW 2000

Conference paper

17 Sep 2000

The SawMill multiserver approach

View publication

Abstract

In an initial experiment, we compare file system performance using the Iozone reread benchmark [1]. All the data is from a 500 MHz, Pentium III with 64MB RAM. Iozone was run on three systems: (1) Linux 2.2.1; (2) L4Linux derived from Linux 2.2.1; and (3) SawMill Linux derived from Linux 2.2.1. For our analysis we have focused on reread throughput, in which Iozone reads a file twice and measures the through-put of the second read. For our purpose, reread is a harder test than read because the reread operation is faster and thus the componentization overhead is more critical. Figure 4 shows reread performance for various file sizes using a 4KB record size. For the smaller file sizes (64 and 128 KB), rereads copy data from the L2 cache (file buffer) to the L1 cache (application buffer). Because the copy time is lowest in this case, the performance overhead of IPC is most significant. Beginning with le sizes of 256KB, throughput decreases because the copies are from memory to the L1 cache. For each 4KB record in the reread, we measured approximately 3000 cycles for bare Linux, 4200 cycles for L4Linux, and 3500 cycles for SawMill Linux. Although the multiserver SawMill clearly performs better than the monolithic L4Linux on the microkernel, SawMill is still 500 cycles slower than bare Linux. We expected an overhead of about 200 cycles: The kernel overhead of the first IPC and the full cost of the second (i.e., including the kernel sysenter/sysexit call). To our surprise, an important part of the problem seems to be the generated stub code. We expected cheap stubs since the Flick IDL compiler [8] generates C code such that all its operations are inline generated in the user program. However, the code effectively generated by GCC uses about 150 machine instructions for any client stub, mostly useless copies to and from local variables. For the current measurements, we did not hand code stubs but only improved the Flick/GCC coding through simple specialization. However, it is clear that an improved code generation facility has to be developed that generates near-optimal code which we found to be about 30 instructions.

Conference paper