The future of accelerator programming: Abstraction, performance or can we have both?
Abstract
In a perfect world, code would only be written once and would run on different devices with high efficiency. To a degree, that used to be the case in the era of frequency scaling on a single core. However, due to power limitations, parallel programming has become necessary to obtain performance gains. But parallel architectures differ substantially from each other, often require specialized knowledge to exploit them, and typically necessitate reimplementation and fine tuning of programs. These slow tasks frequently result in situations where most of the time is spent reimplementing old rather than writing new code. The goal of our research is to find programming techniques that increase productivity, maintain high performance, and provide abstraction to free the programmer from these unnecessary and time-consuming tasks. However, such techniques usually come at the cost of substantial performance degradation. This paper investigates current approaches to portable accelerator programming, seeking to answer whether they make it possible to combine high efficiency with sufficient algorithm abstraction. It discusses OpenCL as a potential solution and presents three approaches of writing portable code: GPU-centric, CPU-centric, and combined. By applying the three approaches to a real-world program, we show that it is at least sometimes possible to run exactly the same code on many different devices with minimal performance degradation using parameterization. The main contributions of this paper are an extensive review of the current state-of-the-art and our original approach of addressing the stated problem with the copious-parallelism technique. Copyright 2014 ACM.