There are also concerns about data loss due to hash collisions. How have manufacturers addressed that issue today?
I've had a chance to speak to a number of vendors about this, and for the most part they say that this is pretty much resolved with the newest hashing algorithms. I was talking to Permabit about this subject, and it said that the only way you can be absolutely sure that each chunk of data is completely unique is to take each new chunk of data and compare it to every other chunk of data stored. It also said that's really not practical because if you do that, it creates a huge time delay. So, the compromise is to use a hashing algorithm.
Other companies take a different approach. ExaGrid, for example, cuts data into large segments and analyze contents of each segment to see how they are related to each other, and then it performs byte-level differencing on each segment and stores data that way.
Jerome M. Wendt is the founder and lead analyst of The Datacenter Infrastructure Group, an independent analyst and consulting firm that helps users evaluate the different storage technologies on the market and make the right storage decision for their organization.