Source-side data reduction and backup

For short-term, rapid recoveries of the most recent versions of data, focus should be on source-side data reduction.

Data reduction strategies abound today. There's data deduplication technology, which identifies redundant streams of data and eliminates the duplicates. There's single instancing, which identifies like files and eliminates them. Finally, there's standard compression. With all of these technologies, the data savings is done on the target, where the backup data set is stored (tape or disk). While that has value for long-term data storage,...

it does nothing to help other backup process problems such as reducing the backup window, reducing impact on the client and, most importantly, reducing network bandwidth requirements.

While attempts to reduce data at its source aren't new, they haven't traditionally been well-executed.
The backup data set also has to be reduced at its source -- at the application server, where the backup process starts before it's sent across the network to the backup server. Doing so can truly reduce the backup window, backup network consumption and even improve backup usefulness.

While attempts to reduce data at its source aren't new, they haven't traditionally been well-executed. The oldest technique, dating back to the early days of mainframe use, is synthetic full backups. The concept is to do a full backup once, and from that point forward to only back up incrementally changed data, then at some point consolidate that full into a new full. While most applications list this as a capability, many have shortcomings in delivery; for example, they don't deliver a true standalone full backup after consolidation.

There's even been attempts to implement data deduplication on the application server prior to transmitting the data across the network. The challenge to this implementation is that it requires a lot of processing power on those source servers. As a result, source-side data deduplication has been mostly used as a remote-office backup solution.

The latest attempt to reduce backup data at its source by doing block-level incrementals has come from companies like NetApp Inc., with it Open Systems SnapVault (OSSV) and the Express Recovery Server (XRS) feature of SyncSort Inc. Backup Express.

Block-level incrementals

Block-level incremental (BLI) backups deliver the benefits of low network bandwidth utilization, low client impact and efficient storage of the saved backup set. Because BLI backups are a volume-to-volume comparison, there's an initial requirement of a 1:1 disk ratio for backup disk. By using volume-to-volume comparisons, BLIs have very low resource requirements on the backup client. Compared to source-side data deduplication, where comparisons by each server are calculated against all servers in this environment, only one comparison needs to be performed. This results in very low resource utilization on the client side. In fact, many backups can be done in minutes, including the transfer of the changed blocks; of course, time will vary depending on how much block data has changed since the last backup. Because this transfer is of just the changed blocks, the utilization of the network is negligible.

The BLI process first requires creation of a 1:1 image of the servers being backed up on a target disk. Once the original data set is complete, a full is no longer required. In subsequent backups, BLIs will send changed blocks to the backup server and update the primary backup target. Unlike synthetic fulls, no consolidation is required because the update of the "full" backup set happens in real time. However, backups are often used for more than just the most recent copy of data. For example, you may need to recover a version of a database that's a few days old prior to a corruption. To deliver point-in-time functionality a BLI will, prior to backup, take a snapshot of the target volume, allowing for access to prior versions of the backup.

With BLI backup technology, the following advantages can be achieved:

Faster backups. A growing challenge within the backup process today is dealing with file servers that have millions of small files. Traditional backup applications "walk" the file system -- meaning they essentially examine file by file to see if there have been changes to it and if it should be included in the next backup cycle. The result is a very long backup window on those servers because of the time needed to complete the file system walk. BLI backup, by comparison, looks only for changed blocks, a much faster and less resource-intensive scan. In a typical environment, a multimillion file server can back up in approximately five minutes after the initial backup is complete. Even the first backup will complete significantly faster because of the block methodology.

Server re-imaging. Protection of the complete file server, including the special operating system files, has often required either a separate backup or, in some cases, a separate application to perform a complete server state backup for a bare-metal recovery. With BLI backup working at the sub-file level, it's immune to file-open issues and it also captures all changes to a server, application data or OS data, no matter where that change may have occurred. Because only the changed blocks are sent across the network, it's possible to perform a full-system backup each night. Recovery can then be from a boot CD that identifies the original data set and begins the recovery process.

More granular backups are possible. By reducing data at its source, backup can be done more granularly. Traditional backup processes back up incrementally once per night and do full backups over the weekend. With BLI, because of the significantly reduced CPU usage, the short amount of time that the server is actually in a backup state (fewer than five minutes) and small amount of data that travels across the network, backups can happen frequently, even every hour. This approach delivers the equivalent of a full backup every hour, yet it only takes minutes. More frequent protection means less data loss and less re-keying of information in the event of a failure or outage.

Active targets. The resulting target of a BLI backup is an active target. The active target is a real, live and browseable file system. Through a capability called thin cloning, previous backups can be mounted back to a development server, or reporting server, for access to the near-active data set without touching the production data set.

Another capability of an active target is in-place recovery. With servers growing to hundreds of gigabytes and to terabytes, the ability to recover a data set of this size across the network is time consuming. The chances for error are high. With BLI backup and active targets, the data is immediately accessible via an iSCSI mount and, as a result, the application can come back online quickly with little down time. With the application back online, the failed volume is recoverable in the background with no pressure from users.

Backup application integration. Solutions such as continuous data protection (CDP), replication and even SAN-based snapshots can provide specific components of BLI backups. BLI backups can replace all of these tools while providing tight integration to the backup target, be it tape or disk. While a BLI doesn't provide the full functionality of some of these solutions -- for example, CDP can capture every write as it occurs -- it does provide much of that functionality. At some point, the need to move data off of disk and on to a permanent store is desirable. Solutions like CDP, replication and SAN snapshots require a separate process with separate management and monitoring. Integrating BLI capabilities into the backup process eliminates the need to manage and monitor a separate step as well as the cost of acquiring those solutions.

Data reduction at the backup target has value when it comes to long-term data storage. For short-term, rapid recoveries of the most recent versions of data, focus should be placed on source-side data reduction. Doing so dramatically reduces backup windows, increasing backup frequency and lowering data loss.

About the author: George Crump, founder of Storage Switzerland, is an independent storage analyst with over 25 years of experience in the storage industry.

This was first published in June 2008

Dig deeper on Data reduction and deduplication



Enjoy the benefits of Pro+ membership, learn more and join.



Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: