Much of the buzz in the secondary storage market has been around data deduplication in disk-based backup schemes. The benefits of data deduplication are significant. It allows you to either retain backup data on disk for longer periods of time or extend disk-based backup strategies to other tiers of applications in your environment. Either strategy implies that recovery times can be greatly improved (over tape-based backup) for a larger share of the data in your environment. The capacity reduction achieved through data deduplication also reduces network traffic -- which, depending on where the deduplication occurs, can impact the volume of data transferred over a LAN, SAN or WAN and make it more practical for organizations to implement backup consolidation for remote and branch offices and offsite replication for disaster recovery protection. Both scenarios introduce significant improvements over tape-based strategies where physical media has to be physically handled and transferred between sites.
There's been a lot more focus (and vendor marketing) on the data deduplication process for backup -- specifically, when, where, how and to what degree deduplication impacts the process of writing data. However, that focus isn't accompanied by increased enlightenment around how deduplication affects the recovery process -- specifically, how quickly you can recall data for restoration.
During the recovery process, the requested data may not reside in contiguous blocks on disk -- even in non-deduplicated backup. As backup data is expired and storage space is freed, fragmentation can occur, which may increase recovery time. The same concept applies to deduplicated data as unique data -- and pointers to the unique data -- may be stored non-sequentially, slowing down recovery performance.
Some backup and storage systems vendors that offer deduplication features anticipated these recovery performance issues and optimized their products to mask the disk fragmentation problem. Some vendors' solutions, such as ExaGrid Systems Inc. and Sepaton Inc., may keep a copy of the most recent backup in its whole form, enabling more rapid restore of the most recently protected data, vs. other solutions that have to reconstitute data based on days, weeks or months of pointers. Other solutions are architected to distribute the data deduplication workload during backup and reassembly activity during recovery across multiple deduplication engines to speed processing. This is the case with both software- and hardware-based approaches. Vendors that spread deduplication activities across multiple nodes and, importantly, allow additional nodes to be added, may provide better performance scalability over those that have a single ingest/processing point.
Performance is dependent on several factors, including the backup software, network bandwitdth, disk type and more. The time it takes for a single file restore will differ greatly than a full restore. It will, therefore, be important to test how a deduplication engine performs in several recovery scenarios, especially for data stored over a longer period of time, to judge the potential impact of deduplication in your environment.
About this author: Lauren Whitehouse is an analyst with Enterprise Strategy Group and covers data protection technologies. Lauren is a 20-plus-year veteran in the software industry, formerly serving in marketing and software development roles.
Do you have comments on this tip? Let us know.
Please let others know how useful this tip was via the rating scale below. Do you know a helpful backup tip, timesaver or workaround? Email the editors if you'd like to write tips for SearchDataBackup.com.