peshkova - Fotolia
Have you ever wandered through the files on one of your NAS servers? The typical NAS box is a mess of duplicate files, near-duplicates and old data from long-gone employees. It's the epitome of the modern data retention issue. The amount of stored data continues to increase at a dramatic rate, and backup data is no different. Luckily, putting a data retention policy in place is one way to reduce the size of backups and automate the retention process.
Generally, a manual cleanup process is not realistic. Large numbers of unnecessary files are very much "out of sight, out of mind" and, frankly, no admin wants the garbage duty. The problem is magnified with backup data. Many IT shops do a mix of incremental backups and regular full system backups.
These two forms of backup serve different purposes. Incremental backups are a fast backup of any changed or added data, typically on a daily basis, and serve to protect the latest state of the data system. Rebuilding from these backups is time-consuming and slightly more error-prone than we would like to admit, so once a week the full backup drives a stake in the ground of a complete, coherent single copy of the system.
Data protection between the backups is a common concern, so shops either reduce the increment between backups to provide continuous backup where the altered files are queued, or they look to replication approaches to keep multiple copies active. The industry today is something of a spectrum of these approaches, especially with backup and disaster recovery to the cloud as an option.
In fact, one problem is that it is possible to over-protect data by creating too many alternative copies, making verification of recovery a challenge.
In all of this complexity, the decision of when a file is dropped from the backup is very complicated. It isn't just an issue of when to toss the oldest incremental backups, or create a fresh baseline for snapshots or continuous increments. Not to mention that for areas such as healthcare or financial sectors, there are different legal obligations for when data must go and what must be kept.
Policy is the way to go
The answer to this data storage dilemma is crafting a backup data retention policy, where the data storage system can protect or remove data based on pre-defined rules. Using extended metadata tagged to each data object, these systems can schedule deletion and designate where data is to be stored and how it is to be encrypted.
Most of the systems in question use object storage for data. Architecturally, object stores handle the thorny issues of protecting objects against drive failures by replicating them over a number of storage nodes and drives, even to the extent of making geographically distant copies to avoid natural disasters.
A good backup system based on object storage will remove duplicate files by creating a unique hash and comparing that to other objects, resolving the questions of which copy is valid and if all of the copies are identical. The deduplication process can reduce the size of backed up data and is a real money-saver.
Note that this type of deduplication isn't feasible with backup packages that create very large "image" files. The chance of duplicates goes way down. Alternatives such as compression can achieve size reduction, but I would recommend against compressing extremely large files, since they can be slow to access any particular object and are more prone to errors in decompression.
How long do you keep your data?
Other than data constrained by legal rules, how long data should be held is an open question. At one extreme, we have data flowing through sensors, such as cameras in retail stores. These can give useful pointers to how to sell-up a particular customer, but that type of information is very ephemeral, with a shelf life of perhaps 30 minutes.
At the other extreme, we have engineering designs that companies need to keep for 15 or more years, until the last machine ceases to have any service obligations. The best approach to this is to use a policy engine that can have a backup data retention policy set on as many different ways as possible, such as per user, per department, per folder, per file type, per project and so on.
Where should backups go? Today's preferred backup mode is to use public cloud storage. This is economical when compared with in-house backup, and it offers the advantage of effectively being "offline and offsite," which is critical in a world where hacking is the largest risk factor for data. Obviously, data in a public cloud must be encrypted, but it is critical that keys be retained by the data owner. Today, clouds have cool and cold storage tiers. A backup data retention policy can auto-navigate tiering data between the two, saving money in the process.
The role of redundancy in copy data management
Benefits of deduplication in the cloud