Load balance strategy of sata routing algorithm using semantics for deduplication clusters
Abstract
The backup requirement of data centres is tremendous as the size of data created by human is massive and is increasing exponentially. Single node deduplication cannot meet the increasing backup requirement of data centres. A feasible way is the deduplication cluster, which can meet it by adding storage nodes. The data routing strategy is the key of the deduplication cluster. DRSS (data routing strategy using semantics) improves the storage utilization of MCS (minimum chunk signature) data routing strategy a lot. However, for the large deduplication cluster, the load balance of DRSS is worse than MCS. To improve the load balance of DRSS, we propose a load balance strategy used for DRSS, namely DRSSLB. When a node is overloaded, DRSSLB iteratively migrates the current smallest container of the node to the smallest node in the deduplication cluster until this overloaded node becomes non-overloaded. A container is the minimum unit of data migration. Similar files sharing the same features or file names are stored in the same container. This ensures the similar data groups are still in the same node after rebalancing the nodes. We use the dataset from the real world to evaluate DRSSLB. Experimental results show that, for various numbers of nodes of the deduplication cluster, the data skews of DRSSLB are under predefined value while the storage utilizations of DRSSLB do not nearly increase compared with DRSS, with the low penalty (the data migration rate is only 6.5% when the number of nodes is 64).