About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
CGO 2004
Conference paper
Custom data layout for memory parallelism
Abstract
In this paper, we describe a generalized approach to deriving a custom data layout in multiple memory banks for array-based computations, to facilitate high-bandwidth parallel memory accesses in modem architectures where multiple memory banks can simultaneously feed one or more functional units. We do not use a fixed data layout, but rather select application-specific layouts according to access patterns in the code. A unique feature of this approach is its flexibility in the presence of code reordering transformations, such as the loop nest transformations commonly applied to array-based computations. We have implemented this algorithm in the DEFACTO system, a design environment for automatically mapping C programs to hardware implementations for FPGA-based systems. We present experimental results for five multimedia kernels that demonstrate the benefits of this approach. Our results show that custom data layout yields results as good as, or better than, naive or fixed cyclic layouts, and is significantly better for certain access patterns and in the presence of code reordering transformations. When used in conjunction with unrolling loops in a nest to expose instruction-level parallelism, we observe greater than a 75% reduction in the number of memory access cycles and speedups ranging from 3.96 to 46.7 for 8 memories, as compared to using a single memory with no unrolling.