file virtualization technologies -- have the ability to move data in its native form, eliminating the need for proprietary software to access it. Because disk archives are merely a network mount point, an organization can simply move qualifying data to it.
Storing data in an easy-to-access form is going to be critical in the future. Backup applications don't offer this kind of access; they typically store data in some sort of proprietary backup format even if it's storing that data on disk. A tape library --even if it's accessed by an archive application -- requires some type of proprietary programming to move tape media from cartridge slots to tape drives.
Disk archives can achieve the cost and power efficiency of tape through the use of technology like data deduplication and MAID. Data deduplication rates for archive data aren't as efficient as backup data deduplication because a backup will repeatedly process essentially the same or similar data, leading to higher deduplication rates (i.e., this week's full backup is very similar to last week's full backup). Archives, on the other hand, tend to be unique files with less similarity. These differences in data similarity rates, along with the earlier mentioned retention and search differences, are another reason to keep the archive data and backup data on two separate systems. Keeping them separate makes it easier to project an accurate level of your data deduplication rate.
The differences in the archive and backup processes
Data retention: With the basics like cost efficiencies, ease of access and data integrity addressed, the final decision to separate the two systems should be made on the length of time the data needs to be retained, how much of that data there is and if there are any specific legal requirements to retain that information.
There's typically a dramatic difference in how long data needs to be retained. For example, many organizations only have a requirement to retain backup data for one year from when the backup happened, but may have a requirement to retain all email data for seven years. While a separate backup job could be run to make sure the email data is retained to this standard, it becomes difficult to ensure that that's happening and the chance for error increases.
The older this retained data becomes, the less likely it will be readable by your current backup application. Many organizations have switched backup applications in the last seven years to 10 years; there's no reason to think that it won't happen again. The simpler the creation of the archive -- for example, a simple file move -- the easier it will be to recover that data in the future. This again supports the need to maintain the archive in the native format wherever possible.
The amount of data that will need to be retained will drive the need for a separate archive as well. Archive data will be the fastest-growing data set in many enterprises. This size may drive archive to a platform by itself just to handle the sheer growth. In five years, a 3 PB archive may be considered normal for even a smaller shop.
While tape can clearly scale to almost infinite levels, the need to search and find this data is more challenging when it's scattered across thousands of tape cartridges. A tape search engine created by the software that can search across multiple formats will be a required capability.
Searchability: Searchability is also more critical in archives. In backup, you often know the file name, where the prior version of the file is and you may even know what piece of media it's on. With an archive, you may be trying to find a needle that was placed in the haystack six or seven years ago. You may not know anything about the file, you will probably be searching for a keyword like a case number. This is going to require a fast search capability that can find files based on content.
The metadata created by this search engine is going to be large; some projections are as much as 5% of the size of the store (1.5 TB metadata database on a 3 PB archive). By separating backup data from the archive, you're keeping it from being indexed from the search engine, and keeping the growth of the search engines meta database in check.
Regulation requirements: Retention requirements dictated by either corporate governance or by specific legal regulations will also drive archives out of the backup process. These regulations may require that data be stored in a write once, read many (WORM) format to prove a chain of custody and ensure that data hasn't changed.
WORM tape media is available for most formats, but that the decision about what type of media to be used must be made from the beginning. A disk archive allows a volume to be made WORM at anytime. Some solutions even allow the "lock" to expire after a legal requirement has been satisfied. This is a critical flexibility because of the ever-changing nature of legal regulations.
Recovery requirements: The last -- and maybe the most surprising requirement until you think about it -- is recoverability. The need to accurately recover data the first time is actually more critical with archive data than with backups. When an archive is designed correctly, it's most likely the last remaining copy of a piece of data.
A recovery from backup is usually a recover of data that has been accidentally deleted or corrupted. If recovery from last night's backup doesn't work, you can always fall back to the day before. In the event of a total recovery failure, you will have to tell the user they have to re-key the data. While this won't make you popular, no one ends up in jail.
A recovery attempt from an archive is usually done for a specific reason, often a legal one, with fines attached if the attempt isn't successful. The archive recovery has to work, no recreation of data may be allowed. A disk-based system for archive that offers data deduplication can leverage the same algorithms that are used to verify presence of similar data to also perform a data integrity check. By running these algorithms against stored data, they can verify that there has been no degradation and that the data is going to be recoverable when needed. If the media (disk drive) degrades, it can move the data to another disk location prior to data loss.
There are enough differences between data backup and archive that a best practice is to separate them in all but the smallest of businesses. The complexity of mixing the two processes and platforms, managing multiple retention strategies and achieving different recovery objectives will make any perceived cost advantages of combining the two evaporate quickly.
About the author: George Crump, founder of Storage Switzerland, is an independent storage analyst with over 25 years of experience in the storage industry.