Tip

CAS and data deduplication: Partners in archiving

What you will learn: This tip explores CAS and data deduplication, discusses the uses for each

    Requires Free Membership to View

and outlines the technologies' strengths and weaknesses.

Archiving is on everyone's mind right now in the storage world. Compliance demands are pushing users to implement some kind of archiving product, and there's been a lot of interest in finding new ways to deal with the increased amounts of data that companies need to save. Data reduction looks like a promising approach, and content-addressed storage (CAS) and data deduplication have emerged to cope with sprawling data growth. Although they are sometimes confused, CAS and data deduplication are not at all the same thing. While they are both used in archiving data, CAS may or may not include data deduplication, as it is commonly understood, to reduce the amount of data stored.

Data deduplication information
Data deduplication explained 

In-band vs. out-of-band deduplication 

Compression, deduplication and encryption: What's the difference?

Special Report: Data Deduplication
While CAS is a distinct class of products, deduplication isn't a product at all. It's a feature that is found in many kinds of products other than CAS. Many document management applications, especially for email, such as Mimosa System Inc.'s NearPoint archival software for Microsoft Exchange, use it. So do many non-CAS applications and hardware, such as some virtual tape libraries (VTLs), for instance FalconStor, and remote backup software from companies, like Asigra Inc. and others.

Data deduplication examines the data to be saved at the block level looking for duplicate blocks. When it finds a duplicate it replaces it with a reference that points to the original copy of the block. How much space this saves depends on the nature of the data being stored. In some cases, such as email, the savings can run to 20 to 1 or more.

One of the major sources of skepticism about data deduplication is overhead. Obviously, it takes both time and computing power to examine every block of data to be stored and compare it with every block of data currently in storage. Makers of products incorporating data deduplication have spent a lot of time and effort speeding up the process. At the most basic level, most of them use hashing to identify each unique block, and many of them use much more sophisticated schemes. As a result, the throughput of backup and archiving systems using data deduplication has been climbing. Diligent Technologies Corp. recently claimed one of its customers achieved 400 MBps throughput using the latest version of the company's ProtecTier disk-based backup product.

CAS is a much broader concept that data deduplication. As the term is used today, it refers to systems that locate items by unique identifiers based on the content itself rather than its location in storage.

When an object, such as a document, is stored in a CAS system, its content is scanned and an identifier, such as a hash value, is generated. This identifier is then used to retrieve the object as needed. Since two identical objects, such as a duplicate of the same document, will generate the same identifier and only one copy will be stored. This is one of the sources of confusion between the terms. Single instancing is not nearly as efficient at saving storage space as block-level data deduplication, and when most people talk about data deduplication they mean block-level data deduplication.

One of the major attractions of CAS is that because each object's identifier is based on its content, it is easy to verify that the retrieved object hasn't been changed since it was stored. This makes CAS very attractive for compliance-related storage.

Of course, that also means that any change to an object stored in a CAS system creates a new that is stored separately. This is one of the reasons CAS is best suited to data which will not change once it is saved. The other reason is overhead. Storing an object in a CAS system requires significantly more time and computing power than storing it in a conventional file system. Retrieval is much less affected.

Even more than data deduplication, CAS is currently a hot backup and archiving technology. CAS systems are available from more than a dozen vendors ranging from very large, such as EMC (Centera) and Hewlett-Packard Co. (HP) (StorageWorks), to small, such as PermaBit Inc. with its Dynamic Information Services appliance.

Again, even more than data deduplication, CAS systems vary enormously in approach, architecture, capacity, throughput and price. Storage administrators who are considering CAS need to perform a thorough review of their needs and carefully research the available products to find the best match for their enterprise.

About the author: Rick Cook specializes in writing about issues related to storage and storage management.

This was first published in May 2007

There are Comments. Add yours.

 
TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

Disclaimer: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.