What's the difference between data backup and data archiving?
There are several key differences between data backup and data archiving. To begin with, data backups usually copy data to a sequential access recovery medium while data archiving moves data to a lower tier random access medium while possibly leaving a stub behind. In addition, backups typically retain multiple copies of the data being protected while archives retain a single copy of the data being stored either through data deduplication or content-addressable storage (CAS) technologies. Another major difference is that data archiving provides indexing and search capabilities within files whereas data backups offer at most a keyword search based on policy or backup image. And finally, data backups are generally designed for short-term storage for recovery purposes as opposed to data archives which are designed for long-term storage for regulatory and legal compliance.
Most data archiving platforms support data deduplication, whether the dedupe is through single-instance storage of email attachments, file-level deduplication, block-level deduplication or content-addressable storage algorithms. Until recently, the only way to achieve this with data backup was to use a third party in-stream appliance to dedupe the data on its way to the final storage media. Now, CommVault Systems Inc., EMC Corp., IBM Corp. and Symantec Corp. all offer software-based content-aware data deduplication products for the data backup streams. This feature clearly removes the argument that data backup does not have single-instance storage.
Another advance on the backup front is the implementation of data lifecycle management. As opposed to simple backup image expiration based on dates and a two-tiered storage approach based on utilization, most data backup software now supports a multi-tiered storage approach that can be determined by service-level agreements (SLAs), recovery time objectives (RTOs) or recovery point objectives (RPOs) based on business requirements. The basic concept is that critical applications have their backups stored on disk and less critical applications have their data stored on some type of sequential media. The backup software now incorporates migration between tiers based on age of data so that a continuous waterfall of data from most expensive and fastest recovery to the least expensive and slowest recovery. However, the entire backup image is migrated or expired through the lifecycle, so the granularity is not comparable to archives (which manage lifecycles at the file level).
There are a few compelling data archive features that will most likely never be implemented in the data backup environment, simply because their purpose is counter to the basic design of a good backup system. One feature is the ability to search and index within files. Because the granularity of backup indices stops at the file level, the backup system will never have a need to index based on the contents of a file. However, when performing e-discovery or regulatory searches of business data, full text search capabilities can be the difference between finding or not finding requested data.
The argument could be made that products such as CommVault's Simpana, IBM's Content Manager and Symantec's Enterprise Vault, extend searchable features to the data backup tier. While this would be correct, these products are archive products that tightly integrate with the vendor's corresponding data backup software as a backing store; they are not features of the data backup product itself. In other words, to use the searchable feature of these products the data must be owned by the archive tier and not the backup tier. Another feature of data archiving is random accessibility. All archives are implemented on some form of disk with some form of file system layout. This allows the archive to perform at near-disk speeds.
All data backup systems, by comparison, are implemented on some form of sequential access data archive (usually tar or cpio) regardless of the medium on which they are stored (disk, optical or tape). This means that the backup has to read from the beginning of the archive to the point where the data that is being requested starts, no matter how big or small. The backup system is designed to provide fast write of the data archive in its entirety during the backup and a fast read of the data in its entirety during restore; it is not designed to provide fast pinpoint access in the case of a single file edit, tagging, or search. These two fundamental features of data archiving: search capabilities and random access, will most likely maintain the distance between data archive and data backups no matter how much additional functionality the vendors choose to pack into the backup software.
While data backup has come a long way and has adopted several key features of a good data archive, they are still not the same. The odds are that backup and archive will continue to evolve hand-in-hand, but never completely merge as their features and functionality remain compliments of each other, not replacements of each other.
About this author: Ron Scruggs has more than 17 years experience as a senior level engineer and consultant in storage, backup and server management. He has been an integral part of enterprise level storage, backup and recovery deployments for the past 10 years in multiple industries including government, financial, medical and IT service providers. Ron is currently serving as a senior consultant for GlassHouse Technologies in Framingham, MA, and is providing data protection architecture services in the Boston area.
This was first published in December 2009