In a generalized shuffle permutation an address (aq-1aq-2 ... a0) receives its content from an address obtained through a cyclic shift on a subset of the q dimensions used for the encoding of the addresses. Bit-complementation may be combined with the shift. We give an algorithm that requires K 2 + 2 exchanges for K elements per processor, when storage dimensions are part of the permutation, and concurrent communication on all ports of every processor is possible. The number of element exchanges in sequence is independent of the number of processor dimensions σr in the permutation. With no storage dimensions in the permutation our best algorithm requires (σ4r + 1)[ K 2σr] element exchanges. We also give an algorithm for σr = 2, or the real shuffle consists of a number of cycles of length two, that requires K 2 + 1 element exchanges in sequence when there is no bit complement. The lower bound is K 2 for both real and mixed shuffles with no bit-complementation. The minimum number of communication start-ups is σr for both cases, which is also the lower bound. The data transfer time for communication restricted to one port per processor is σr( K 2), and the minimum number of start-ups is σr. The analysis is verified by experimental results on the Intel iPSC/1, and for one case also on the Connection Machine model CM-2. © 1992.