OpenCL is an open standard for heterogeneous parallel programming, exploiting multi-core CPUs, GPUs, or other accelerators as parallel computing resources. Recent work has extended the OpenCL parallel programming model for distributed heterogeneous clusters. For such loosely coupled acceleration architectures, the design of OpenCL programs to maximize performance is quite different from that of conventional tightly coupled acceleration platforms. This paper describes our experiences in OpenCL programming to extract scalable performance for a distributed heterogeneous cluster environment. We picked two real-world analytics workloads, Two-Step Cluster and Linear Regression, that offer different challenges to efficient OpenCL implementations. We obtained scalable performance with this architecture by carefully managing the amount of data and computations in the kernel program design and by well addressing the network latency problems through optimizations. Copyright 2013 ACM.