The continuous data growth often referred to as the "data explosion" is old news to most enterprise data storage professionals. Most also agree that the constant development of new business applications, digital media services and social networking software multiplies the opportunities to generate massive amounts of data that will keep driving storage demands for years to come. And because it is impossible to prevent or even slow down the generation of data, many data storage environments have shifted their focus to data reduction for their data backup and recovery plan. Although the first methods of data reduction that come to mind are data compression and data deduplication, other data reduction solutions include single-instance storage (SIS), data archiving and data destruction/deletion.
In the late 1970s, compression algorithms were developed in an effort to address the increasing demand to store text files to disk; data compression tools such as Lempel Ziv (LZW) represent some of the early data reduction efforts. In the early 2000s, data deduplication emerged as the newest data reduction technology and became widely adopted within a few years. However, both technologies have limitations in terms of performance and capabilities depending on the type of data targeted.
For example, text or database data will typically yield very good compression ratios, while imaging or video files are more difficult to compress without losing some of the quality or resolution; this is essentially what differentiates what is known as lossless from "lossy" compression. Data deduplication ratios will also vary widely, depending on the type of data targeted. For example, encrypted data typically produces poor results because of the randomizing effect of encryption. In addition, deduplication is still not favored for primary storage due to performance limitations and is better suited for secondary storage targets such as data backups and archives.
Next-generation data compression
There have been some major improvements with data compression technology, and companies such as Ocarina Networks and Storwize Inc. have found a way around the traditional performance limitations of CPU intensive software by moving compression to an appliance that sits between host and primary disk. As an added bonus, compressed data can be deduplicated once it moves from primary disk to backup or archive storage media. However, this technology is still relatively new and currently limited to network-attached storage (NAS)-based storage. Future releases will eventually address Fibre Channel (FC) and iSCSI, but will also likely need to become compatible with emerging technologies like the network convergence Fibre Channel over Ethernet (FCoE).
Other data reduction options
Unfortunately, outside of deduplication and compression, technology options remain somewhat limited in support of data reduction and, in some instances, do not necessarily reduce the amount of data stored. Another data reduction option that remains is data deletion or disposition, which can also be supported by technology but requires a very human component known as a "policy." But let's first look at the other technology options for data reduction before getting into the disposition aspect.
SIS is a technology that looks for identical files in a particular data storage environment and when found, replaces all extra copies with pointers to a single file (single instance) that can be shared. A good example of this technology is a capability within Microsoft Exchange where an email attachment sent to 30 recipients is stored only once and referenced in multiple inboxes. This is transparent to end users as the attachment appears to be in each individual inbox; meanwhile, the resulting data reduction ratio for the attachment in our example is 30 to one. This data reduction method is effective mostly in data storage environments where users share a lot of identical data.
Data archiving is often promoted as a data reduction option, but in reality, it often only migrates or moves data. Archiving tools can reduce the amount of data needed to be managed or backed up on a daily basis by moving infrequently used or no longer needed data to a different storage media or location. However, while archives reduce the amount of production data, it does not necessarily reduce the overall amount of data. This is because the data that is moved to tape or other storage media does not automatically constitute data reduction. However, when data archiving is combined with technologies such as SIS, data deduplication and compression, then data reduction really starts taking place.
Data disposal or deletion
Data deletion is the only other viable data reduction option for environments when deduplication, compression and SIS do not meet the requirements. However, data deletion is by far the least popular option among storage professional and business managers. Data deletion is unpopular because of all the legal implications relating to regulatory compliance, freedom of information, e-discovery, etc. Before going on a data deletion rampage, there are a few things to consider:
- Define a clear policy on what type of data can be stored on corporate servers. File servers are often used to store user data and many companies do not spend much time looking at what ends up on the user drives. Finding music, photos and movie files on storage arrays is not uncommon.
- Create an email retention policy and enforce it. One relatively easy way to handle this is to implement an email archive tool such as Symantec Corp.'s Enterprise Vault. Symantec's Enterprise Vault allows you to archive messages, perform discovery when needed and also set a retention after which archives can be deleted. This product also works for file systems and Microsoft SharePoint data. Also, there are other email archive tools available, such as those from Informatica Corp., which are specifically designed for the archival of databases driven applications such as CRMs and ERPs.
- Beware of PST files (or personal email archive files), especially when trying to enforce email disposal policies. Many users have discovered that they can save email messages to PST files before they are automatically archived and eventually deleted. This practice can undermine data reduction efforts especially if users store these PST files on the corporate file server. Creation of PST files can also nullify a corporate email retention policy by allowing messages that are marked for deletion to actually remain accessible.
As discussed, the number of options available to reduce the amount of stored data is still limited. In environments where data does not lend itself to deduplication or resides on storage that is not yet supported by the next generation compression such as Fibre Channel or iSCSI, the disposition of data might be the only viable alternative to achieve data reduction. However, the decision to dispose of data is not a trivial one and requires careful consideration and legal advice. The disposal of data also needs to be supported by a clear and enforceable policy. Assigning ownership of the policy and overall process is therefore essential as we have all learned that speed limits are meaningless without a traffic cop to enforce them.
About this author: Pierre Dorion is the data center practice director and a senior consultant with Long View Systems Inc. in Phoenix, Ariz., specializing in the areas of business continuity and DR planning services and corporate data protection.