What is the difference between archives and backups?

Backups are primarily used for operational recoveries, to quickly recover an overwritten file or corrupted database. The focus is on speed, both to back up and recover, and on data integrity. Archives, on the other hand, typically store a version of a file that's no longer changing, or shouldn't be changing.

    Requires Free Membership to View

Just as disk has become a popular addition to the backup process because of the concerns about recovery from tape, data that's archived to tape should be considered just as vulnerable.
Speed is less important in archives; even if the event is a legal action, you typically only have a few days to respond. Searchabilty is more critical in archives. In addition, importance is placed on the ability to scale data integrity and data retention over a long period of time, possibly decades. An archive is no longer limited to traditional files and images; most database applications have specific archive capabilities to allow the primary database to stay lean and fast while the archive is retained for research and compliance.

Email archiving applications are often the catalyst for establishing a separate archive process. It's important to realize that you are legally responsible to do more than just capture email.

When considering combining archive and backup onto a single platform, the decision will depend on the specific platform, what the organization's retention requirements are, and the expected goals of the backup and archive process.

Can tapes be used for archives?

While the vast majority of organizations consider tape for their long-term archives, and companies like Index Engines Inc. provide the ability to more effectively search for data on tape, there's a risk in counting on tape for the storage of archive data.

Just as disk has become a popular addition to the backup process because of the concerns about recovery from tape, data that's archived to tape should be considered just as vulnerable. It's difficult to develop an ongoing process to verify the integrity of tape, leading to greater concerns the longer the media sits on a shelf. There's also a simple technology issue. Even if your retention requirements are only seven years, think back seven years ago -- LTO-1 or LTO-2 was becoming the standard, DLT in the Super DLT form was still considered competition. What's the likelihood that the LTO-1 tape that's been sitting on the shelf can be read and successfully restored from in the new LTO-4 drives in your data center? Anecdotal stories of sub-50% success rates aren't uncommon.

Even if the hardware works, how are you going to find a piece of data that's seven years old from hundreds -- and possibly thousands -- of tapes? Most backup applications don't keep their metadata (the data about the data being backed up) very long. In fact, the average length of time is approximately 90 days to 120 days. After that, it's up to your records-retention skills or the person who had your job before you, or even the person before them. Recovery of data this old is most likely going to require guesswork, plenty of manual scans and lots of time.

Should you use disk for archives?

The thought of keeping all archives on disk may seem impossible and costly, but companies like EMC Corp., Hewlett-Packard Co. and Permabit Technology Corp. are delivering technology today to make a disk archive that can last for 25, 50 or 100 years a reality. But, the disk drive you start with today won't be the same disk drive that you use 100 years from now (if we still even use disk drives 100 years from now).

While disk also makes combining the process of backup and recovery into a single platform more realistic than tape, a best practice is to have a specific system for archives. Archives have different retention requirements, different recovery needs and different searchability requirements than backups.

Most disk-based archive systems present themselves as a network mount point, which makes access over time realistic. Unlike a seven-year-old tape drive, you access a CIFS or NFS mount in an almost identical fashion from seven years ago as you do today.

Many archive applications -- and especially file virtualization technologies -- have the ability to move data in its native form, eliminating the need for proprietary software to access it. Because disk archives are merely a network mount point, an organization can simply move qualifying data to it.

Storing data in an easy-to-access form is going to be critical in the future. Backup applications don't offer this kind of access; they typically store data in some sort of proprietary backup format even if it's storing that data on disk. A tape library --even if it's accessed by an archive application -- requires some type of proprietary programming to move tape media from cartridge slots to tape drives.

Disk archives can achieve the cost and power efficiency of tape through the use of technology like data deduplication and MAID. Data deduplication rates for archive data aren't as efficient as backup data deduplication because a backup will repeatedly process essentially the same or similar data, leading to higher deduplication rates (i.e., this week's full backup is very similar to last week's full backup). Archives, on the other hand, tend to be unique files with less similarity. These differences in data similarity rates, along with the earlier mentioned retention and search differences, are another reason to keep the archive data and backup data on two separate systems. Keeping them separate makes it easier to project an accurate level of your data deduplication rate.

The differences in the archive and backup processes

Data retention: With the basics like cost efficiencies, ease of access and data integrity addressed, the final decision to separate the two systems should be made on the length of time the data needs to be retained, how much of that data there is and if there are any specific legal requirements to retain that information.

There's typically a dramatic difference in how long data needs to be retained. For example, many organizations only have a requirement to retain backup data for one year from when the backup happened, but may have a requirement to retain all email data for seven years. While a separate backup job could be run to make sure the email data is retained to this standard, it becomes difficult to ensure that that's happening and the chance for error increases.

The older this retained data becomes, the less likely it will be readable by your current backup application. Many organizations have switched backup applications in the last seven years to 10 years; there's no reason to think that it won't happen again. The simpler the creation of the archive -- for example, a simple file move -- the easier it will be to recover that data in the future. This again supports the need to maintain the archive in the native format wherever possible.

The amount of data that will need to be retained will drive the need for a separate archive as well. Archive data will be the fastest-growing data set in many enterprises. This size may drive archive to a platform by itself just to handle the sheer growth. In five years, a 3 PB archive may be considered normal for even a smaller shop.

While tape can clearly scale to almost infinite levels, the need to search and find this data is more challenging when it's scattered across thousands of tape cartridges. A tape search engine created by the software that can search across multiple formats will be a required capability.

Searchability: Searchability is also more critical in archives. In backup, you often know the file name, where the prior version of the file is and you may even know what piece of media it's on. With an archive, you may be trying to find a needle that was placed in the haystack six or seven years ago. You may not know anything about the file, you will probably be searching for a keyword like a case number. This is going to require a fast search capability that can find files based on content.

The metadata created by this search engine is going to be large; some projections are as much as 5% of the size of the store (1.5 TB metadata database on a 3 PB archive). By separating backup data from the archive, you're keeping it from being indexed from the search engine, and keeping the growth of the search engines meta database in check.

Regulation requirements: Retention requirements dictated by either corporate governance or by specific legal regulations will also drive archives out of the backup process. These regulations may require that data be stored in a write once, read many (WORM) format to prove a chain of custody and ensure that data hasn't changed. WORM tape media is available for most formats, but that the decision about what type of media to be used must be made from the beginning. A disk archive allows a volume to be made WORM at anytime. Some solutions even allow the "lock" to expire after a legal requirement has been satisfied. This is a critical flexibility because of the ever-changing nature of legal regulations.

Recovery requirements: The last -- and maybe the most surprising requirement until you think about it -- is recoverability. The need to accurately recover data the first time is actually more critical with archive data than with backups. When an archive is designed correctly, it's most likely the last remaining copy of a piece of data.

A recovery from backup is usually a recover of data that has been accidentally deleted or corrupted. If recovery from last night's backup doesn't work, you can always fall back to the day before. In the event of a total recovery failure, you will have to tell the user they have to re-key the data. While this won't make you popular, no one ends up in jail.

A recovery attempt from an archive is usually done for a specific reason, often a legal one, with fines attached if the attempt isn't successful. The archive recovery has to work, no recreation of data may be allowed. A disk-based system for archive that offers data deduplication can leverage the same algorithms that are used to verify presence of similar data to also perform a data integrity check. By running these algorithms against stored data, they can verify that there has been no degradation and that the data is going to be recoverable when needed. If the media (disk drive) degrades, it can move the data to another disk location prior to data loss.

There are enough differences between data backup and archive that a best practice is to separate them in all but the smallest of businesses. The complexity of mixing the two processes and platforms, managing multiple retention strategies and achieving different recovery objectives will make any perceived cost advantages of combining the two evaporate quickly.

About the author: George Crump, founder of Storage Switzerland, is an independent storage analyst with over 25 years of experience in the storage industry.

This was first published in May 2008

There are Comments. Add yours.

TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

Disclaimer: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.