About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
DPDS 1990
Conference paper
An effective algorithm for parallelizing sort merge joins in the presence of data skew
Abstract
A parallel sort-merge-join algorithm that uses a divide-and-conquer approach to address the data skew problem is proposed. The algorithm adds an extra scheduling phase to the usual sort, transfer, and join phases. During the scheduling phase, a parallelizable optimization algorithm, using the output of the sort phase, attempts to balance the load across the multiple processors in the subsequent join phase. The algorithm naturally identifies the largest skew elements and assigns each of them to an optimal number of processors. Assuming a Zipf-like distribution for data skew, the algorithm is shown to achieve very good load balancing for the join phase in a CPU-bound environment and to be very robust relative to the degree of data skew and the total number of processors.