I've had a chance to speak to a number of vendors about this, and for the most part they say that this is pretty much resolved with the newest hashing algorithms. I was talking to Permabit about this subject, and it said that the only way you can be absolutely sure that each chunk of data is completely unique is to take each new chunk of data and compare it to every other chunk of data stored. It also said that's really not practical because if you do that, it creates a huge time delay. So, the compromise is to use a hashing algorithm.
Other companies take a different approach. ExaGrid, for example, cuts data into large segments and analyze contents of each segment to see how they are related to each other, and then it performs byte-level differencing on each segment and stores data that way.
Check out the entire Data Deduplication FAQ.