# Efficiently Extendible Mappings for Balanced Data Distribution

## Abstract

In data storage applications, a large collection of consecutively numbered data "buckets" are often mapped to a relatively small collection of consecutively numbered storage "bins." For example, in parallel database applications, buckets correspond to hash buckets of data and bins correspond to database nodes. In disk array applications, buckets correspond to logical tracks and bins correspond to physical disks in an array. Measures of the "goodness" of a mapping method include: (1) The time (number of operations) needed to compute the mapping. (2) The storage needed to store a representation of the mapping. (3) The balance of the mapping, i.e., the extent to which all bins receive the same number of buckets. (4) The cost of relocation, that is, the number of buckets that must be relocated to a new bin if a new mapping is needed due to an expansion of the number of bins or the number of buckets. One contribution of this paper is to give a new mapping method, the Interval-Round-Robin (IRR) method. The IRR method has optimal balance and relocation cost, and its time complexity and storage requirements compare favorably with known methods. Specifically, if m is the number of times that the number of bins and/or buckets has increased, then the time complexity is O(log m) and the storage is O(m2). Another contribution of the paper is to identify the concept of a history-independent mapping, meaning informally that the mapping does not "remember" the past history of expansions to the number of buckets and bins, but only the current number of buckets and bins. Thus, such mappings require very little information to be stored. Assuming that balance and relocation are optimal, we prove that history-independent mappings are possible if the number of buckets is fixed (so only the number of bins can increase), but not possible if the number of bins and buckets can both increase.