A long-term user-centric analysis of deduplication patterns
Abstract
Deduplication has become essential in disk-based backup systems, but there have been few long-term studies of backup workloads. Most past studies either were of a small static snapshot or covered only a short period that was not representative of how a backup system evolves over time. For this paper, we collected 21 months of data from a shared user file system; 33 users and over 4,000 snapshots are covered. We analyzed the data set for a variety of essential characteristics. However, our primary focus was individual user data. Despite apparently similar roles and behavior in all of our users, we found significant differences in their deduplication ratios. Moreover, the data that some users share with others had a much higher deduplication ratio than average. We analyze this behavior and make recommendations for future deduplication systems design.