By W. Curtis Preston
Depending on your industry, you may actually have no data archiving requirements. Financial trading houses are required to archive all communications with customers. Anyone subject to the United States Health Insurance Portability and Accountability Act of 1996 (HIPAA) has serious long-term storage requirements, as they must store data beyond the life of the patient. Companies that are often subject to intellectual property/patent lawsuits may want to archive such material to prove that they did indeed come up with it first. The best thing to do here is to have meetings with the various stakeholders to ask them what their long-term storage requirements are. The legal department should also be present at each of these meetings to ensure that they run each request through the "how could this hurt us?" filter.
This is one of those areas where less is definitely more. I cannot state this any plainer than "if you do not have an archiving requirement, don't archive." Keeping data longer than you need to keep it can actually be more harmful than helpful -- especially if you live in the United States (the most litigious society in history). The Federal Rules of Civil Procedure guidelines state rather plainly that if you have it, you have to give it up. Imagine if you weren't required to store emails for several years, but you did it anyway. Even if your company has done nothing wrong, you have now created a tremendous burden on your storage department to supply this information to the plaintiff of the lawsuit. However, if your policy was to regularly purge this data, there would be nothing to supply.
Document your data retention policies
Once you've established what you're going to archive -- and more importantly what you're not going to archive -- you need to document your data retention practices. Document what you're keeping and not keeping, and what your data destruction policies are. For example, your policy could state that any data not subject to archiving requirements is kept for 180 days and then destroyed. Then you need to document adherence to this policy, which is more than just setting your backup retention settings to 180 days -- you need to delete the data. (Expired backup tapes are still discoverable; you will be required to scan them back into your backup system -- what a pain!)
What I would do is write a script that looks for expired backup tapes and then relabels them. The first, last and only file on the tape will then be the ANSI label the backup software uses as an electronic label. Since it's the last file on the tape, just after it will be an end-of-data mark, which is impossible to get past with any tape drive or virtual tape drive. Therefore, while there is technically data on the rest of the tape, nothing can get to it. Translation: It's not discoverable. Document the policy, document the script, then document that you're using both by auditing the practice every so often and include that with the documentation. You really want to have your ducks in a row if you want to go into court and say that your backups aren't' discoverable after n days, but if you do what is suggested here, you should be fine.
Choosing data archiving software
Now that you've determined what you're going to archive, you need to determine how you're going to archive it. Let's go back to the first sentence in this article -- you need to use actual data archiving software. What is data
archive software? It is software that allows you to search by different contexts than just server, application/directory, file name/email. That's all standard data backup software can do, grab a known file or email out of a known directory/application from a known server from one point in time. That is the only context it knows. Archive software, on the other hand, needs to be able to grab a series of emails/files from a number of applications/directories and a number of servers, from a large range of time -- potentially up to seven years. Some offerings in this space include products such as Autonomy Zantaz, Iron Mountain/Mimosa NearPoint and Symantec Corp. Enterprise Vault. There are also many other niche players in the data archive software market.
The difference between data backup and data archive software can best be seen in the difference between a restore (what you do with backup software) and a retrieval (what you do with archive software).
- A restore request might ask: "Give me /home/curtis/thing.doc from elvis on 7/30/2010
- A restore request might ask: "Give me a single email from Curtis with a subject line of Whatzamajigger from 7/25/2010"
- A retrieval request might ask: "Give me all files of any kind, that were created between 7/1/2007 and 7/1/2010, on any server, with the words 'project bilko' in them"
- A retrieval request might ask: "Give me all emails from Curtis to anyone outside the company written from 7/1/2003 and 7/1/2010 that contain the words 'promise,' or 'guarantee'"
Do you see how different these requests are? Can you imagine satisfying either of the last two requests with standard backup software? If you were unlucky enough to have weekly full backups of your email system for the last seven years, you would have to perform 364 (52 x7) restores of your email system to extract the data that you need. In addition, backups of your email system can only be restored to the version of the email software you used at that time, which can only run on the version of the operating system you were using at that time, each of which had their own patch levels, etc. You do not want to go down this road. If you have an archiving requirement, you want to use data archiving software.
It is true that some backup software is starting to have archive retrieval capabilities, so you might want to talk to your backup software vendor before beginning your archiving search -- you may have what you need already! If not, however, you need to embark on a search for a proper email and/or filesystem archiving tool. Then remember to perform a full proof-of-concept test on any such tool before deploying it: They are not all created equal. Happy hunting.
About this author: W. Curtis Preston (a.k.a. "Mr. Backup"), executive editor and independent backup expert, has been singularly focused on data backup and recovery for more than 15 years. From starting as a backup admin at a $35 billion dollar credit card company to being one of the most sought-after consultants, writers and speakers in this space, it's hard to find someone more focused on recovering lost data. He is the webmaster of BackupCentral.com, the author of hundreds of articles, and the books "Backup and Recovery" and "Using SANs and NAS."
This was first published in August 2010