Deduplication helps organizations use backup media as efficiently as possible by eliminating data redundancy. Because it is so effective, all the major backup vendors offer some sort of deduplication mechanism.
Most backup-oriented deduplication mechanisms today are based on the concept of block-level deduplication, which eliminates redundant storage blocks. Each vendor has its own way of doing things, but generally speaking, the deduplication engine examines each storage block and uses a mathematical formula to calculate a hash for each block. These hashes are stored in a hash table. When a new storage block needs to be written to disk, the data is hashed, and the resulting hash is compared against the hash table to determine whether the block is redundant or unique. If the block is unique, it is written to disk and its hash is added to the hash table. If the block is redundant, a pointer is updated to reference the pre-existing storage block.
Some vendors use a more efficient hash table lookup algorithm than others. Similarly, vendors differ with regard to the methods they use to avoid mistaking unique data for redundant data. In spite of these differences, block-level deduplication mechanisms tend to suffer from scalability problems.
Concerns with block-level deduplication
The entire dedupe process is based around analyzing and manipulating individual storage blocks. These storage blocks can vary in size, but often range from 4 KB to approximately 10 KB. Because each block is so small, a huge number of blocks can exist on even a modestly sized disk. The actual number of blocks on a disk varies based on factors such as block size, disk format and the amount of overhead required. To give you a rough idea of how many blocks could potentially exist, this conversion calculator estimates there are 2,097,152 blocks per gigabyte. The site doesn't specify the file system or the block size used, but if this estimate is correct, it would mean there are 2.1 billion blocks for each terabyte (TB) of storage.
Block-level deduplication uses a hash table to track every storage block. Even though the hash table is smaller than the data itself, there eventually comes a point where the hash table becomes unwieldy. ExaGrid estimates that a hash table consumes approximately a billion storage blocks for each 10 TB of data referenced.
Enterprise-class organizations have dealt with this problem by using a front-end controller. This is essentially a dedicated appliance that uses its own CPU, memory and storage to manage the deduplication process.
Although the requirement for such an appliance probably will not stop an organization from using deduplication, it does force the IT staff to consider how much data may eventually need to be deduplicated. Even a dedicated appliance can eventually be outgrown. When this happens, the organization will have to engage in the costly and difficult process of upgrading to a larger controller.
Advantages of zone-level deduplication
Zone-level deduplication, which is proprietary to ExaGrid, has two main advantages over block-level deduplication:
- It examines chunks of data that are significantly larger than blocks. Performing deduplication at a less granular level effectively reduces the size of the hash table. ExaGrid estimates the hash table size to be 1,000 times smaller than it would be if block-level deduplication were used.
- It supports scale-out architectures. If an organization begins to outgrow its front-end controller, it can simply add another controller rather than trying to upgrade to a larger controller. Subsequent controllers can be added on an as-needed basis.
Zone-level deduplication effectively solves the scalability problems encountered with block-level deduplication. Although this is a proprietary technology, it is designed to be vendor-agnostic to the point that it will work with any backup application.
When to disable block-level deduplication process
Enhance block-level deduplication with sector alignment
Best practices for data deduplication and backup