Continuous collision detection (CCD) is a process to interpolate the trajectory of polygons and detect collisions between successive time steps. However, primitive-level CCD is a very time-consuming process especially for a large number of moving polygons. Over the years, a number of approaches have been proposed to improve the computational efficiency of CCD by culling out the non-colliding primitives before exact overlap tests. These approaches have two fundamental disadvantages. First, they are mainly designed for selfand pairwise CCD and thus the performance gain would be limited when they are applied to large-scale scenes that contain thousands of moving polygons. Second, they are designed as sequential processes appropriate for execution on a single processor. Therefore, deploying them on high-performance parallel computing systems would not increase their computational efficiency. In this paper, we present a parallel CCD algorithm, which aims to accelerate N-body CCD culling by distributing the load across a high-performance GPU cluster. Our implementation integrates frameworks such as Message Passing Interface and CUDA, which is particularly suitable for large-scale distributed simulations. Experimental results, based on simulations conducted on a supercomputer, demonstrate that our approach is more computationally efficient than existing sequential CCD approaches.