Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

How to ensure scaling and reliability in data deduplication systems

A data deduplication system needs to be able to scale so that your files are safe and available at all times.

When implementing a data deduplication system, it's important to consider scalability. Performance should remain acceptable as the storage capacity and deduplication granularity increase. Data deduplication should also be unaffected by data loss due to errors in the deduplication algorithm.

Scaling and hash collisions

It is critical that data deduplication products detect duplicate data elements, making the determination that one file, block or byte is identical to another. Data deduplication products determine this by processing every data element through a mathematical "hashing" algorithm to create a unique identifier called a hash number. Each number is then compiled into a list, often dubbed the hash index.

When the system processes new data elements, their resulting hash numbers are compared against the hash numbers already in the index. If a new data element produces a hash number identical to an entry already in the index, the new data is considered a duplicate, and it is not saved to disk -- only a small reference "stub" that relates back to the identical data that has been stored. If the new hash number is not already in the index, the data element is considered new and stored to disk normally.

A data element can produce an identical hash result even though the data is not completely identical to the saved version. Such a false positive, also called a hash collision, can lead to data loss. There are two ways to mitigate false positives.

  • The data deduplication vendor may opt to use more than one hashing algorithm on each data element. For example, the Single Instance Repository (SIR) on FalconStor Software Corp.'s virtual tape libraries (VTL) uses out-of-band indexing with SHA-1 and MD5 algorithms. This dramatically reduces the potential for false positives.
  • Another option is to use a single hashing algorithm but perform a bit-level comparison of data elements that register as identical.

The problem with both approaches is that they require more processing power from the host system, reducing index performance and slowing the deduplication process. As the deduplication process becomes more granular and examines smaller chunks of data, the index becomes much larger and the probability of collisions increases and can exacerbate any performance hit.

Scaling and encryption

Another issue is the relationship between deduplication, more traditional compression and encryption in a company's storage infrastructure. Ordinary compression removes redundancy from files, and encryption "scrambles" data so that it is completely random and unreadable. Both compression and encryption play an important role in data storage, but eliminating redundancy in the data can impair the deduplication process. If encryption or traditional compression are required along with deduplication, the indexing and deduplication should be performed first.

Check out the entire Data Deduplication Handbook.

Dig Deeper on Data storage backup tools

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.