Data deduplication technology today and tomorrow

It may seem like data deduplication for backup has been around forever, but there are plenty of companies that have yet to add dedupe to their backup operations.

It may seem like data deduplication technology for backup has been around forever, but there are plenty of companies that have yet to add dedupe to their backup operations.

Data deduplication is the process of eliminating redundant data by comparing new segments with segments already stored and only keeping one copy. The technology can lead to a significant reduction in required storage space, especially in situations where redundancy is high. As a result, data deduplication has firmly established itself in the backup market. But not every data center uses deduplication. For example, Storage magazine's most recent Purchasing Intentions survey found that more than 60% of data centers haven't added data deduplication technology to their backup operations.

Deduplication reluctance

The level of resistance to deduplication may come as a surprise to many in the storage industry. While it appears to be a maturing technology and the term "deduplication" is so commonly used, it's easy to assume the technology is in use everywhere. The reality, as the survey shows, is that data deduplication is still an emerging technology with plenty of market left to be captured. This is good news for vendors that are still trying to enter into or expand their presence in the purpose-built backup appliance (PBBA) market, and it's what's driving the next generation of deduplication devices.

Where data deduplication is today

Before looking at the latest developments in data deduplication, it makes sense to look at the current state of deduplication and to understand the reasons behind the resistance. While some backup applications have added deduplication capabilities, most companies begin to use the technology when it's hosted on some sort of backup appliance or PBBA. This appliance typically comprises three parts: software, hardware and storage capacity. Data sent to the device is analyzed by the deduplication software as it's received or after it's stored, so redundant data can be identified and eliminated.

This process highlights many of the reasons for the lack of deduplication traction. First, the data center must have enough data to make buying a PBBA realistic. With hard drive capacities now reaching 3 TB to 4 TB, a small server with four or five of those drives may provide all the backup capacity a smaller data center needs without having to resort to deduplication or the expense of a PBBA.

Second, deduplication only provides a return if there's redundant data being backed up. An increasing number of data centers are using some form of an incremental forever strategy like VMware's Changed Block Tracking (CBT). Not only does CBT reduce the amount of data transferred, it significantly reduces the amount of redundant data that would be stored a second time.

Third, a lack of trust remains a big area of concern for data centers. Most deduplication technologies have been vetted, but as PBBAs grow in capacity, reliability and performance, problems can appear. The time it takes the deduplication engine to determine if data is redundant will impact performance and an inaccurate identification can lead to additional capacity being used anyway or worse if net new data isn't properly stored.

The reliability of the system and the data it stores is a big concern since data deduplication is a technology that, by default, tries to not store data. A mistake could be catastrophic and many data centers still aren't ready to put their trust in the technology.

Performance problems typically stem from a deduplication system not being designed correctly. Deduplication lookups are a lot like traditional database lookups. The more deduplicated data a dedupe systems stores, the more lookups need to occur, and as more lookup processes occur it takes longer for new data to be written to the system. For these reasons most deduplication vendors try to store as much of their index as possible in DRAM, which helps performance but can increase the price of the PBBA. Even with more DRAM, as the unit scales in capacity, eventually the PBBA will begin caching some of its lookup index to disk, which hurts performance.

Some vendors are turning to solid-state drives (SSDs) to augment DRAM and help with lookup performance. The problem is that SSDs, while they can improve performance, add to the overall cost of the system. But again, as overall system capacity scales, the problem may crop up again as operations outgrow the SSD capacity, causing the system to once again cache to hard disk.

Data dedupe in primary storage

Deduplication isn’t used by just backup appliances and backup apps; the technology is also finding its way into primary storage, but the requirements are different. There’s not enough redundant data to deliver the high data reduction rate the backup process yields. Where backup deduplication usually produces an average 15:1 efficiency rate, primary storage may see a rate of 3:1 to 5:1, depending on the environment.

For hard disk drive (HDD)-based primary storage systems, this return rate may not be worth the investment since the cost per gigabyte (GB) of HDD storage is already low. Primary storage deduplication makes more sense in flash-based storage systems. While the efficiency rate of 3:1 to 5:1 is the same, the cost per GB of flash-based storage is 10x or more than the cost of an HDD system. In addition, flash-based systems typically have performance to spare, so they can handle the overhead of deduplication better than HDD-based systems.

Scaling of the deduplication tablespace is more critical in primary storage systems because lookup time directly impacts application performance and user experience. Vendors need to invest in ensuring that their deduplication algorithms are efficient because the deduplication technologies that were originally created for the backup process may not be viable for primary storage.

Over time, as flash-based storage becomes more commonplace, expect it to pull data deduplication adoption rates up, just as purpose-built backup appliances created the initial wave of deduplication adoption in the backup process. A key future strategy will be how to maintain deduplicated data efficiency between primary and secondary tiers without re-inflating or un-deduplicating data as it moves between them.

Deduplication appliance alternatives

The PBBA deduplication market has also been impacted by the growing success of the appliance vendors' three biggest competitors: backup software, cloud storage and tape. It used to be that PBBA vendors and tape software vendors were best of friends and a perfect complement to each other. Then backup software developers began to develop their own deduplication code and add support for VMware's CBT technologies. That means backup software could be coupled with just about any standalone disk array and be able to provide similar functionality to the PBBA.

There are limitations to the backup software approach. First, a user would have to commit fully to one software application to gain maximum deduplication efficiency. This rarely happens in larger data centers as multiple backup solutions are used daily. Second, while backup software vendors like to justify the expense of their deduplication modules by coupling them with the cheapest disk prices that can be found, an investment still has to be made in a reliable disk platform. The absolute cheapest disk won't be reliable enough for most data centers. Finally, there's the unpredictability of backup server CPU loading and memory utilization when backup and dedupe are combined.

Cloud storage is now a trusted backup target and many vendors have solved the challenge created by the limited bandwidth of the cloud by offering on-premises appliances for data centers that act as a cache for backup data that will ultimately go to a cloud storage service. In many cases these systems store the local copy in its native, undeduplicated form; all subsequent copies are replicated to the cloud where they may or may not be deduplicated.

The cloud alternative brings a few additional features that are compelling. Cloud storage is a pay-as-you-grow type of arrangement where storage is paid for on a monthly basis, and users never have to experience the cost of a forklift upgrade. The second feature, which is growing in popularity, is the ability to start a server remotely, either on the on-premises appliance or in the cloud. This brings a new level of availability to businesses that may not have invested in it in the past.

While cloud is relatively new, the other alternative -- tape -- is relatively old, at least in data center terms. Tape technology is making a comeback as a new generation of IT professionals experience it for the first time. They're finding tape has matured, and is now faster and more reliable than previous generations. Tape has always enjoyed a cost-per-gigabyte advantage over disk -- a gap that deduplication reduced to some extent. But with recent updates, tape has once again widened that price delta and is by far the most cost-effective backup storage media available.

The next generation of data deduplication

Data deduplication vendors are adding capabilities to their products to help increase its adoption rate and to fight off challenges from alternatives like the cloud, tape and even regular disk in a server. PBBAs are evolving from just disk storage systems with deduplication software to truly complete data protection devices that can be integrated into applications and backup software for improved efficiency and management.

Improved accuracy. PBBA vendors are improving accuracy and efficiency by developing specific algorithms that understand how certain applications store data and how to best parse that data into segments that will correctly identify redundancy. This software integration also allows for certain applications or backup software to directly control interactions with the PBBA so that two separate processes no longer need to be run. The application or backup software can trigger a backup to the device and then control which subcomponents will be replicated to another device off-site.

Improved scalability. Some PBBA vendors are improving this integration further by leveraging their supportive software modules to make sure that some deduplication preflight checking of data is done prior to sending that data across the network to eliminate obviously redundant data. This spreads the data deduplication processing workload between the application server and the deduplication appliance, which should lessen the load on the appliance and resolve some of the scaling issues mentioned earlier.

Cloud and tape integration. Cloud and even tape support are also on appliance vendors' integration lists. Many deduplication appliances can already replicate to an identical appliance. Now vendors are adding the ability to replicate their data to a cloud service, saving customers the cost and maintenance of a second system. It also keeps cloud providers from having to develop their own hybrid appliances.

In similar fashion, tape is being integrated into these devices as either a spillover for additional capacity or to make an image copy of the PBBA in case of a disaster. The spillover integration is most interesting, as it slows the growth in the disk capacity of the appliance.

Global deduplication. As noted earlier, deduplication is only efficient if redundant data is sent to it. The efficiency is reduced as more individual appliances are deployed because they can't compare data segments with each other. Vendors are addressing this shortcoming by bringing a scale-out storage capability to deduplication where a single, global deduplication process runs across all the systems. This allows for maximum identification opportunity while reducing costs.

Better dedupe algorithms. Data deduplication technology vendors are also fine-tuning their deduplication algorithms so redundancy checking can be done more efficiently. These vendors are learning that the typical first-in/first-out process common in a caching environment isn't appropriate for a deduplication appliance. Better algorithms lead to devices that create smaller indexes, use less RAM and allow the system to scale to higher capacities.

Virtualized PBBAs. Finally, some PBBA providers are dropping the physical appliance requirement and shifting to a software appliance option. This is the PBBA with deduplication delivered as a virtual machine allowing it to integrate directly into the environment with no physical installation, and no additional power and cooling requirements.

A virtualized PBBA creates many new possibilities for PBBA vendors. They can move into smaller markets because the purchase price of the solution is significantly less than that of hardware-based products. They also allow the virtual appliance to be installed in the cloud so that users can replicate data to it. For larger enterprises, it makes branch-office deployment easier and less expensive since there's no physical hardware to implement and maintain.

Promising developments for data dedupe

Based on the current level of adoption, data deduplication for backup still has significant ground to gain in the data center. PBBAs with enhanced integration to existing applications will broaden deduplication's appeal by making the backup process less complicated. Virtualizing PBBAs and delivering them as virtual appliances will lower the cost of adoption and make them more affordable for smaller data centers. And integration with tape systems will make deduplication and PBBAs more cost effective for larger enterprises. Reducing complexity and cost are the keys to widespread adoption; as vendors continue to focus on these key areas, deduplication and deduplicating appliances will become the dominant first tier in data protection.

About the author:
George Crump is president of Storage Switzerland, an IT analyst firm focused on storage and virtualization.


Dig Deeper on Data reduction and deduplication