The three major solutions for increasing the nominal performance of a CPU are: multiplying the number of cores per socket, expanding the embedded cache memories and use multi-threading to reduce the impact of the deep memory hierarchy. Systems with tens or hundreds of hardware threads, all sharing a cache coherent UMA or NUMA memory space, are today the de-facto standard. While these solutions can easily provide benefits in a multi-program environment, they require recoding of applications to leverage the available parallelism. Threads must synchronize and exchange data, and the overall performance is heavily influenced by the overhead added by these mechanisms, especially as developers try to exploit finer grain parallelism to be able to use all available resources. Copyright © 2012 ACM.