Archiving is on everyone's mind right now in the storage world. Compliance demands are pushing users to implement some kind of archiving product, and there's been a lot of interest in finding new ways to deal with the increased amounts of data that companies need to save. Data reduction looks like a promising approach, and content-addressed storage (CAS) and data deduplication have emerged to cope with sprawling data growth. Although they are sometimes confused, CAS and data deduplication are not at all the same thing. While they are both used in archiving data, CAS may or may not include data deduplication, as it is commonly understood, to reduce the amount of data stored.
Data deduplication examines the data to be saved at the block level looking for duplicate blocks. When it finds a duplicate it replaces it with a reference that points to the original copy of the block. How much space this saves depends on the nature of the data being stored. In some cases, such as email, the savings can run to 20 to 1 or more.
One of the major sources of skepticism about data deduplication is overhead. Obviously, it takes both time and computing power to examine every block of data to be stored and compare it with every block of data currently in storage. Makers of products incorporating data deduplication have spent a lot of time and effort speeding up the process. At the most basic level, most of them use hashing to identify each unique block, and many of them use much more sophisticated schemes. As a result, the throughput of backup and archiving systems using data deduplication has been climbing. Diligent Technologies Corp. recently claimed one of its customers achieved 400 MBps throughput using the latest version of the company's ProtecTier disk-based backup product.
CAS is a much broader concept that data deduplication. As the term is used today, it refers to systems that locate items by unique identifiers based on the content itself rather than its location in storage.
When an object, such as a document, is stored in a CAS system, its content is scanned and an identifier, such as a hash value, is generated. This identifier is then used to retrieve the object as needed. Since two identical objects, such as a duplicate of the same document, will generate the same identifier and only one copy will be stored. This is one of the sources of confusion between the terms. Single instancing is not nearly as efficient at saving storage space as block-level data deduplication, and when most people talk about data deduplication they mean block-level data deduplication.
One of the major attractions of CAS is that because each object's identifier is based on its content, it is easy to verify that the retrieved object hasn't been changed since it was stored. This makes CAS very attractive for compliance-related storage.
Of course, that also means that any change to an object stored in a CAS system creates a new that is stored separately. This is one of the reasons CAS is best suited to data which will not change once it is saved. The other reason is overhead. Storing an object in a CAS system requires significantly more time and computing power than storing it in a conventional file system. Retrieval is much less affected.
Even more than data deduplication, CAS is currently a hot backup and archiving technology. CAS systems are available from more than a dozen vendors ranging from very large, such as EMC (Centera) and Hewlett-Packard Co. (HP) (StorageWorks), to small, such as PermaBit Inc. with its Dynamic Information Services appliance.
Again, even more than data deduplication, CAS systems vary enormously in approach, architecture, capacity, throughput and price. Storage administrators who are considering CAS need to perform a thorough review of their needs and carefully research the available products to find the best match for their enterprise.
About the author: Rick Cook specializes in writing about issues related to storage and storage management.
This was first published in May 2007