Andrea Danti - Fotolia

Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

Data protection techniques for object storage systems

Techniques such as replication and erasure coding protect data on object storage systems and other high-capacity primary storage systems when traditional backup is difficult.

Object storage systems are designed to cost effectively store a lot of data for a very long period of time. However,...

that makes traditional backup difficult, if not impossible. To ensure data is protected from both disk failure and corruption, vendors use replication or erasure coding (or a combination of the two).

Even if you are not considering object storage, understanding the differences between these data protection techniques is important since many primary storage arrays are beginning to use them. We explore the pros and cons of each approach so you can determine which method of data protection is best for your data center.

Scale-out basics

Replication works well for data centers with less than 25 TB. But as data grows, continuing with a replication strategy for data protection becomes untenable.

Most object storage systems, as well as converged systems, rely on scale-out storage architectures. These architectures are built around a cluster of servers that provide storage capacity and performance. Each time another node is added to the cluster, the performance and capacity of the overall cluster is increased.

These systems require redundancy across multiple storage nodes so that if one node fails, data can still be accessed. Typical RAID levels such as RAID 5 and RAID 6 are particularly ill-suited for this multi-node data distribution because of their slow rebuild times.

Replication pros and cons

Replication was the most prevalent form of data protection in early object storage systems and is becoming a common data protection technique in converged infrastructures, which are also node-based.

In this protection scheme, each unique object is copied a given number of times to a specified number of nodes, where the number of copies and how they're distributed (how many nodes receive a copy) is set manually or by policy. Many of these products also have the ability to control the location of the nodes that will receive the copies. They can be in different racks, different rows and, of course, different data centers.

The advantage of replication is that it is a relatively lightweight process, in that no complex calculations have to be made (compared with erasure coding). Also, it creates fully usable, standalone copies that are not dependent on any other data set for access. In converged or hyperconverged architectures, replication also allows for better virtual machine performance since all data can be served up locally.

The obvious downside to replication is that full, complete copies are made, and each redundant copy consumes that much more storage capacity. For smaller environments, this can be a minor detail. For environments with multiple petabytes of information, it can be a real problem. For example, a 5 PB environment could require 15 PB of total capacity, assuming a relatively common three-copy strategy.

What is erasure coding?

erasure coding is a parity-based data protection scheme similar to RAID 5 and 6. But erasure coding operates at a lower level of granularity. In RAID 5 and 6, the lowest common denominator is the volume, where with erasure coding, it is the object. This means if there is a drive failure or node failure, only the objects on that drive or node need to be recreated, not the entire volume.

Similar to replication, erasure coding can be set either manually or by policy to survive a certain number of node failures before there is data loss. Many systems extend erasure coding between data centers, so that the data can be automatically distributed between data centers and nodes within those data centers.

Since it is parity-based, erasure coding does not create multiple, redundant copies of data the way replication does. This means the cost of additional capacity "overhead" for erasure coding is measured in fractions of the primary data set instead of multiples. An erasure-coded methodology designed to provide protection from the same number of failures as a 3x replication method requires an approximately 25% overhead instead of 300%.

The downside to erasure coding is that it's not as lightweight as replication. It typically requires more CPU and RAM resources to manage and to calculate the parity. More importantly, every access requires data to be reconstituted (since erasure-coded data is parsed and stored in changed block increments across nodes). This process can bog down considerably across the storage network compared with replication, which again could be designed to require little or no storage network traffic. The requirement for additional network traffic could be particularly troublesome in a WAN or cloud implementation since the WAN will create latency on every access.

Blended model

In an attempt to deliver the best of both worlds, some vendors are creating blended models. The first form of this is one where replication is the method used within the data center, so that most accesses from storage have the benefit of LAN-like performance. Then, erasure coding is used for data distribution to the other data centers in the organization. While capacity consumption is still high, data integrity is equally high.

The other blended model is based solely on erasure coding, but the erasure coding is zoned by data center. In this model, erasure coding is used locally and across the WAN, but one copy of all the data remains in the data center that needs it most. Then data is erasure-coded remotely across the other data centers in the customer's ecosystem. While this method consumes more capacity than regular erasure coding, it is still more efficient than the other blended model.

What is the best data protection method?

As is always, the answer is, "It depends." There is a lot to like about the simplicity of replication. It works well for data centers with less than 25 TB. But as data grows, continuing with a replication strategy for data protection becomes untenable.

For larger data centers, replication becomes too expensive from a capacity consumption perspective. If they have high bandwidth or short interconnection distances, erasure coding provides excellent storage efficiency and ideal data distribution. Data centers that have latency issues should consider one of the blended models; most likely, the second model which provides almost as good data efficiency and eliminates network latency issues for day-to-day data accesses.

About the author:
George Crump is president of Storage Switzerland, an IT analyst firm focused on storage and virtualization.

Next Steps

How to decide what type of protection your applications need

Should you protect data with RAID or erasure codes?

New tools for your data protection toolkit

Newer data protection techniques for your storage environment

Dig Deeper on Disk-based backup