Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle hardware busy while relieving overloaded hardware of its burden. Prior work has demonstrated that work-stealing is very effective in practice. However, workstealing comes with a substantial overhead: as much as 2- to 12- slowdown over orthodox sequential code. In this paper we identify the key sources of overhead in work-stealing schedulers and present two significant refinements to their implementation. We evaluate our workstealing designs using a range of benchmarks, four different work-stealing implementations, including the popular fork-join framework, and a range of architectures. On these benchmarks, compared to orthodox sequential Java, our fastest design has an overhead of just 15%. By contrast, fork-join has a 2.3- overhead and the previous implementation of the system we use has an overhead of 4.1-. These results and our insight into the sources of overhead for workstealing implementations give further hope to an already promising technique for exploiting increasingly available hardware parallelism.