A data deduplication ratio is the measurement of data's original size versus the data's size after removing redundancy.
Data deduplication is the process that removes redundant data before a data backup. The data deduplication ratio measures the effectiveness of the dedupe process. It is calculated by dividing the total capacity of backed up data before removing duplicates by the actual capacity used after the backup is complete. For example, a 5:1 data deduplication ratio means that five times more data is protected than the physical space required to store it.
When a vendor states it can achieve a certain deduplication ratio, that number indicates a best-case scenario. Because deduplication works by removing redundant data, if no redundant data exists, then deduplication is impossible. Some types of data, such as MPEG videos or JPEG images, are already compressed and contain little redundancy.
As the deduplication ratio increases, the dedupe process generates comparatively weaker returns. A 100:1 ratio eliminates 99% of the data. Increasing the ratio to 500:1, which eliminates 99.8% of data, would not reduce much more data since most of the redundancy has already been removed.
Several factors affect data deduplication ratios, including:
- Data retention -- The longer data is retained, the greater the probability of finding redundancy
- Data type -- For example, an environment with primarily Windows servers and similar files will likely produce a higher ratio
Independent backup expert W. Curtis Preston discusses several various approaches to implementing dedupe.
- Change rate -- High data change rates often yield low deduplication ratios
- Location -- The wider the scope, the higher the likelihood of finding duplicates. Global deduplication, which compares data across multiple systems, will usually generate more reduction than data deduped locally on one device
Variable chunking is effective in recognizing duplicate data and increases the data deduplication ratio through the ability to match smaller data more easily and quickly. On the other hand, legacy data protection software that only dedupes data across a single data stream results in lower dedupe ratios and increases the cost of data protection.