Improving the performance of work-stealing loadbalancing algorithms in distributed shared-memory systems is challenging. These algorithms need to overcome high costs of contention among workers, communication and remote datareferences between nodes, and their impact on the locality preferences of tasks. Prior research focus on stealing from a victim that best exploits data locality, and on using special deques that minimize the contention between local and remote workers. This work explores the selection of tasks that are favourable for migration across nodes in a distributed memory cluster, a lesserexplored dimension to distributed work-stealing. The selection of tasks is guided by the application-level task locality rather than hardware memory topology as is the norm in the literature. The prototype for the performance evaluation of these ideas is implemented in X10, a realization of the asynchronous partitioned global address space programming model. This evaluation reveals the applicability of this new approach on several real-world applications chosen from the Cowichan and the Lonestar suites. On a cluster of 128 processors, the new work-stealing strategy demonstrates a speedup between 12% and 31% over X10's existing scheduler. Moreover, the new strategy does not degrade the performance of any of the applications studied. © 2013 IEEE.