In this Storage Decisions presentation, Howard Marks, chief scientist at Networks Are Our Lives Inc., outlines how deduplication is performed in Microsoft Hyper-V 3.0.
"So [the] basic function of most data deduplication systems is they divide the data in your system into chunks, use [a] hash to locate other chunks that are similar, [and] do a byte-by-byte compare to see if that data is actually the same. If it turns out those two chunks contain the same data, instead of storing both, they store one, and use a pointer to replace it."
Marks noted in his presentation that Windows Server 8 and Hyper-V 3.0 feature native deduplication of data stored in Microsoft's New Technology File System (NTFS). He said that duplicate chunks of data -- those between 32 KB to 128 KB in size -- are kept in the System Volume Information store to cut down on seeking, which is similar to the technique used for RAM and network I/O.
He said two basic types of data deduplication exist: inline, which determines whether a piece of data is a duplicate in real-time, and post-process, which involves having a review of data at set intervals. He said that Microsoft chose post-process deduplication.
"If you've got 150 Web servers all running Apache -- for the kids who are going to take the Web server development class -- that's gonna deduplicate down into the space used by 1.3 of those servers, and will save you a lot of space…, said Marks."[And] if the data stays deduplicated all the way up the cache in your disk array, it's going to improve your performance as well."