Characterizing a complex J2EE workload: A comprehensive analysis and opportunities for optimizations

Yefim Shuf; Ian M. Steiner

doi:10.1109/ISPASS.2007.363735

ISPASS 2007

Conference paper

17 Dec 2007

Characterizing a complex J2EE workload: A comprehensive analysis and opportunities for optimizations

View publication

Abstract

While past studies of relatively simple Java benchmarks like SPECjvm98 and SPECjbb2000 have been integral in advancing the server industry, this paper presents an analysis of a significantly more complex 3-Tier J2EE (Java 2 Enterprise Edition) commercial workload, SPECjAppServer2004. Understanding the nature of such commercial workloads is critical to develop the next generation of servers and identify promising directions for systems and software research. In this study, we validate and disprove several assumptions commonly made about Java workloads. For instance, on a tuned system with an appropriately sized heap, the fraction of CPU time spent on garbage collection for this complex workload is small (<2%) compared to commonly studied client-side Java benchmarks. Unlike small benchmarks, this workload has a rather "flat" method profile with no obvious hot spots. Therefore, new performance analysis techniques and tools to identify opportunities for optimizations are needed because the traditional 90/10 rule of thumb does not apply. We evaluate hardware performance monitor data and use insights to motivate future research. We find that this workload has a relatively high CPI and a branch misprediction rate. We observe that almost one half of executed instructions are loads and stores and that the data working set is large. There are very few cache-to-cache "modified data" transfers which limits opportunities for intelligent thread co-scheduling. We note that while using large pages for a Java heap is a simple and effective way to reduce TLB misses and improve performance, there is room to reduce translation misses further by placing executable code into large pages. We use statistical correlation to quantify the relationship between various hardware events and an overall system performance. We find that CPI is strongly correlated with branch mispredictions, translation misses, instruction cache misses, and bursty data cache misses that trigger data prefetching. We note that target address mispredictions for indirect branches (corresponding to Java virtual method calls) are strongly correlated with instruction cache misses. Our observations can be used by hardware and runtime architects to estimate potential benefits of performance enhancements being considered. ©2007 IEEE.

Paper