This is the first part of a nine-part series on deduplication. For easy access to all nine parts, check out our quick overview of Deduplication 2013.
Deduplication has evolved considerably over the years. A couple of decades ago, deduplication was used primarily as a tool for compressing archives. ZIP files and CAB files are both examples of compressed archives that are based on early forms of deduplication.
Over time, deduplication technology evolved, and a new form of deduplication called single-instance storage was developed. Single-instance storage is based on the idea that a piece of data only needs to be stored once. Some solutions perform file-level, single-instance storage as a way of eliminating the storage of duplicate files.
Single-instance storage has also been used in messaging, in products such as Exchange Server 2007. Suppose, for instance, that a user sends an e-mail message with a large attachment to 100 different recipients within the company. If 100 separate instances of the message were stored in the mailbox database, a lot of space would be consumed. Single-instance storage allows a single copy of the message to be stored in the mailbox database. Each recipient's mailbox is assigned a pointer to the message. Incidentally, Microsoft discontinued the use of single-instance storage in Exchange Server 2010 and Exchange Server 2013.
Today, deduplication is typically performed at the block level. This allows data to be duplicated more efficiently than would be possible through file-level duplication or other forms of single-instance storage. Deduplication is used by backup products as a way to reduce the storage requirements for disk-based backups. It is also used for WAN optimization, which makes centralized backups and backups to the cloud more practical. Part 1 of this series looks at hardware-based deduplication.
Virtual tape libraries have long been the go-to appliance for organizations wishing to perform deduplicated, disk-based backups. Virtual tape library appliances emulate a physical tape drive or a physical tape library, but use disk-based storage instead of tape.
There are several advantages to using virtual tape libraries. First, virtual tape libraries allow organizations to easily transition from tape-based backup to disk-based backup. Because virtual tape libraries emulate tape devices, a virtual tape library can easily be used with existing backup software that is designed specifically for tape.
Another advantage to using virtual tape libraries is that disk-based backup technologies tend to be more efficient than tape. Tape-based backups require data to be streamed. This streaming process can impact the backup performance, because of the way that data transfers must be throttled to meet the tapes' streaming requirements. Disk-based backups do not depend on streaming, which eliminates any inefficiencies related to the streaming process.
Because virtual tape libraries contain limited storage capacities, they almost always offer storage deduplication. Data deduplication is typically performed in line, which means that the data is deduplicated in real time, as it is written to the appliance. The actual level of storage compression that can be achieved is based on a number of different variables, such as the make and model of the virtual tape library appliance and the type of data that is being stored. However, it is not uncommon to achieve compression ratios in the 20:1 to 25:1 range, and the compression ratios can sometimes be much higher.
The reason why appliances are able to achieve such high data compression ratios is that the compression process is handled at the hardware level. Since the deduplication process is performed by dedicated hardware (as opposed to exerting a load on a production server), hardware vendors are free to use extremely aggressive compression algorithms.
Although it is easy to think of a backup appliance as a self-contained device, most modern backup appliances include built-in data replication capabilities. These capabilities give organizations a great deal of flexibility with regard to offsite data protection. Data can be written to tape, replicated to offsite storage, or even replicated to the cloud. The fact that any data stored within the appliance has already been deduplicated makes it possible to efficiently send large amounts of data across a WAN link in a reasonable amount of time. For example, EMC Data Domain has a throughput of up to 31 TB per hour, but uses deduplication to ensure that only compressed data is sent across the network when replicating the contents of one appliance to another.
Today, there are a number of vendors that offer backup appliances with built-in deduplication capabilities, and each vendor tends to incorporate its own unique set of features. A particularly good example of this is Riverbed's Whitewater Cloud Storage Gateway. Like similar appliances from other vendors, the Riverbed appliance stores backups locally, but also replicates stored data to cloud storage in near-real time. However, Riverbed also offers tiered storage capabilities that are not found in all backup appliances. Organizations with archive data that must be retained can use the Riverbed appliance to move the archive data off expensive on-premise storage and onto lower-cost cloud storage.
Part 2 of this series explores software-based deduplication.
Brien M. Posey, MCSE, has received Microsoft's MVP award for Exchange Server, Windows Server and Internet Information Server (IIS). Brien has served as CIO for a nationwide chain of hospitals and has been responsible for the department of information management at Fort Knox. You can visit Brien's personal website at www.brienposey.com.
This was first published in April 2013