# The uniform memory hierarchy model of computation

## Abstract

The Uniform Memory Hierarchy (UMH) model introduced in this paper captures performance-relevant aspects of the hierarchical nature of computer memory. It is used to quantify architectural requirements of several algorithms and to ratify the faster speeds achieved by tuned implementations that use improved data-movement strategies. A sequential computer's memory is modeled as a sequence 〈M 0, M 1,...〉 of increasingly large memory modules. Computation takes place in M 0. Thus, M 0 might model a computer's central processor, while M 1 might be cache memory, M 2 main memory, and so on. For each module M u, a bus B u connects it with the next larger module Mu+1. All buses may be active simultaneously. Data is transferred along a bus in fixed-sized blocks. The size of these blocks, the time required to transfer a block, and the number of blocks that fit in a module are larger for modules farther from the processor. The UMH model is parametrized by the rate at which the blocksizes increase and by the ratio of the blockcount to the blocksize. A third parameter, the transfer-cost (inverse bandwidth) function, determines the time to transfer blocks at the different levels of the hierarchy. UMH analysis refines traditional methods of algorithm analysis by including the cost of data movement throughout the memory hierarchy. The communication efficiency of a program is a ratio measuring the portion of UMH running time during which M0 is active. An algorithm that can be implemented by a program whose communication efficiency is nonzero in the limit is said to be communication- efficient. The communication efficiency of a program depends on the parameters of the UMH model, most importantly on the transfer-cost function. A threshold function separates those transfer-cost functions for which an algorithm is communication-efficient from those that are too costly. Threshold functions for matrix transpose, standard matrix multiplication, and Fast Fourier Transform algorithms are established by exhibiting communication-efficient programs at the threshold and showing that more expensive transfer-cost functions are too costly. A parallel computer can be modeled as a tree of memory modules with computation occurring at the leaves. Threshold functions are established for multiplication of N×N matrices using up to N2 processors in a tree with constant branching factor. © 1994 Springer-Verlag New York Inc.