GPU architectures have emerged as a viable way of considerably improving performance for appropriate applications. Program fragments (kernels) appropriate for GPU execution can be implemented in CUDA or OpenCL and glued into an application via an API. While there is plenty of evidence of performance improvements using this approach, there are many issues with productivity. Programmers must understand an additional programming model and API to program the accelerator; concurrency and synchronization in this programming model is typically expressed differently from the programming model for the host. On top of this, the languages used to write kernels are very low level and thus prone to the kinds of errors that one does not encounter in higher level languages. Programmers must explicitly deal with moving data back-and-forth between the host and the accelerator. These problems are compounded when the user code must be run across a cluster of accelerated nodes. Now the host programming model must further be extended with constructs to deal with scale-out and remote accelerators. We believe there is a critical need for a single source programming model that can be used to write clean, efficient code for heterogeneous, multi-core and scale-out architectures. The APGAS programming model has been developed for such architectures over the past six years. APGAS is based on four fundamental (and architecture-independent) notions: locality, asynchrony, conditional atomicity and order. X10 is an instantiation of the APGAS programming model on top of a base sequential language with Java-style productivity. Earlier work has shown that X10 can be used to write clean and efficient code for homogeneous multi-cores, SMPs, Cell-accelerated nodes, and clusters of such nodes. In this paper we show how X10 programmers can write code that can be compiled and run on GPUs. GPU programming idioms such as threads, blocks, barriers, constant memory, local registers, shared memory variables, etc. can be directly expressed in X10, and do not require new language extensions. We present the design of an extension of the X10-to-C++ compiler which recognizes such idioms and produces CUDA kernel code. We show several benchmarks written in this style. The performance of these kernels is within 80% of hand-written CUDA kernels. We believe these results establish X10 as a single-source programming language in which clean, efficient programs can be written for GPU-accelerated clusters. © 2011 ACM.