Andrea Danti - Fotolia
Published: 06 Jun 2016
The introduction of disk-based backup systems ushered in a revolution in data protection. As a random-access practice, backup to disk provided for a faster and more flexible recovery process. There was no more waiting for tapes to be mounted, loaded and fast-forwarded to start a restore job, and there was no risk of contention between jobs for the same tape. Features such as data deduplication (first seen in backup products) made the concept of disk-based backups financially viable through savings of up to 20:1 or more in logical versus physical data stored.
Looking back at the use of disk and tape for data protection, we can see that tape didn't provide a great deal of flexibility to do more with backup contents. Accessing media was very much a serial process, making it difficult to retrieve data in a creative fashion. However, with data on disk, the ability to search and retrieve content became significantly easier.
Deduplication leads to huge savings
Deduplicating appliances from the likes of Data Domain and ExaGrid are storage systems specifically designed for the ingestion and storage of large volumes of sequentially written data, such as with backups. The deduplication function significantly reduces the amount of physical data managed by the system compared to the logical volume of data written. Savings can be of the order of 95% or a 20:1 reduction.
Over the years, this has led to the development of backup technologies and emerging products to use backup data and hardware in a more cost-effective manner. In this article, we describe how backup targets have evolved and explain what makes the newest backup systems different and more efficient.
We back up applications and data simply because bad things -- including hardware failure, software bugs, data corruption, user errors and, more recently, malicious hacking -- happen. Backups provide a point-in-time copy of the image of an application, allowing us to go back to a previous version of the application and retrieve either part or all of that data.
An important feature here is the idea of taking a copy at a specific point in time. We can also synchronize backups with other processes (like completing batch work or lots of database updates) to allow us to return an application to a point before an event occurred. Most of the time, we restore because that event was bad (i.e., our data was corrupted). However, that doesn't always have to be the case.
Server virtualization: The perfect backup storm
Before going further in our look at backup technologies, we should mention how server virtualization led to the implementation of more flexible backups. Before virtualized applications, backup and restore targeted either the physical disks in a server or a copy of the data on shared storage. Restoring an application to a location other than the primary server meant having additional hardware available, usually of a similar or identical configuration. This expensive overhead went away with server virtualization.
Today, we can create a clone of an application by building a copy of the files that make up a virtual server image. Sure, a few tweaks are necessary to make that image accessible to the hypervisor, but with only a little effort, we can create a duplicate version of any application and run it in the same or a different data center.
This capability is already available in many backup products.
The benefits of consolidation
It's clear that server virtualization and disk-based backups have made data protection much easier to implement. However, compared to tape, disk-based storage systems are significantly more expensive to deploy and maintain. And most are fully utilized for only a fraction of any 24-hour period -- specifically, the time when backups are taken (thankfully, we don't need to execute full application restores all that often).
A new range of products are emerging that aim to make better use of backup hardware and the data stored on it. These backup technologies are looking to replace a range of other services that run in the data center -- including copy data management, archiving, backup, analytics and application test/development. We can see why this idea could be both financially and operationally beneficial, as it offers the opportunity to consolidate secondary storage requirements and makes backup data more available and useful than ever.
In the modern data center, critical applications sit on high-performance storage. Test and development systems run off cheaper hardware, perhaps sourced from multiple vendors. And there are other systems for disk-based backup, others again for archive and still more for performing analytics or delivering lower-performance services such as personal and group file shares.
Historically, there have been good reasons for separating primary and secondary data onto separate storage platforms:
- Primary data and backups need to be on separate hardware to mitigate the risk of hardware failure.
- The I/O profiles of applications and other functions like archive and analytics are very different, and so require specific hardware configurations. Primary systems are likely more random in their I/O requirements, whereas archive and analytics are more sequential.
- Getting data out of primary systems and into an archive makes them run faster and reduces the size of the backup window (and also the time to restore).
- Test/development systems have a high turnover rate, with the virtual world providing the ability to pretty much create application images on demand. The volume of data in use can therefore grow and shrink quite significantly, making it cost-effective to place these workloads on cheaper storage where there won't be the same level of application traffic.
From a consolidation perspective, putting all the secondary storage requirements together could save money on hardware. In addition, since the creation of environments for test/development or analytics has been done through operationally intensive processes that utilize existing backups or take copies from the primary systems, simplifying this process would result in substantial operational savings.
The latest in backup appliances
Products with backup capabilities are being brought to market by new companies keen to exploit the opportunity of consolidating secondary storage systems into a single platform.
Founded by Mohit Aron in 2013, Cohesity has thus far received around $70 million in investment. The Cohesity Data Platform is based on a scale-out, node-based architecture that uses a mix of HDD and flash storage. The scale-out nature of the product comes as no surprise, as Aron previously co-founded Nutanix, a hyper-convergence company that also includes a distributed file system.
The Cohesity offering currently provides features that include data protection, test/development environments, file services and analytics. It supports backups for VMware vSphere using the VADP backup API. These backups are essentially point-in-time snapshots of virtual machines (VMs) that can be used to recover VMs or individual files. To recover a VM, it's not necessary to restore data back to the primary storage system. Instead, the backup system acts as a data store and allows for the almost instant recovery of a virtual machine by importing the backup image sitting on the appliance itself. If a recovered VM needs to be kept permanently, it can be moved back to primary storage using features like Storage vMotion.
In addition to restore capabilities, you can harness VM snapshot images for creating test/development environments in a process known as copy data management. VM images can be imported into the test/dev virtual server environment and run from the backup appliance without the need to copy the data out.
Obviously, a traditional backup system wouldn't be designed to operate as a backup target and run live virtual machines. To achieve this, Cohesity developed some software IP that allows their platform to retain a very high number of multiple snapshots, each of which can be accessed independently.
Rubrik is another startup changing the process around backup. The company (founded in 2014) offers a scale-out, node-based system that provides both backup and archiving functions.
Rubrik takes a different approach to the way backup requirements are defined. Instead of talking about backup jobs and schedules, data is protected using policies and automation. The administrator simply selects metrics such as snapshot frequency and retention and then leaves the scheduling of tasks to the system.
The Rubrik platform is able to directly mount snapshots of VMs for testing or recovery purposes. You can also replicate VM images between geographically dispersed locations, providing additional security and the ability to locate test equipment in a cheaper location. As you would expect, only unique deduplicated data is replicated, making the process of moving a VM backup/image highly efficient.
Boosting storage capacity with the cloud
As data volumes increase in any platform used for archive or long-term data retention, the percentage of active data tends to decrease. This has a direct impact on the viability of deploying these kinds of backup technologies purely on disk-based hardware. There's a practical limit in the size of a disk-based backup archive, especially when it starts to rival the primary system in terms of space, power and cooling requirements.
Copy data management does data cloning right
Copy data management is a technique for creating clones of application data using features such as snapshots built into primary or secondary storage systems. CDM products keep track of the storage of multiple images, allowing them to be accessed for purposes such as test/development environments.
Both Cohesity and Rubrik now support the archive or migration of data into the public cloud, extending their systems by delivering almost unlimited storage capacity. There's an old adage in the storage industry that capacity is free, but performance costs. This is effectively the model both vendors are following. The Cohesity and Rubrik systems allow capacity to scale into the public cloud (where the customer pays for storage), while the ability to process more data (either as backups or secondary storage) requires the deployment of extra hardware appliances.
Scale capacity and performance as needed
Scale-out backup systems such as ExaGrid or HPE StoreOnce are built and scaled using multiple nodes or appliances. This has the benefit of enabling the streaming of data to many nodes at the same time and scaling both capacity and performance linearly.
Channelling data through the backup product rather than writing directly to the cloud provides multiple benefits:
- The backup appliance retains a full index of the archived content, providing a single search point. Rubrik, for example, specializes in the simplicity benefits of allowing Google-like searching on data stored in its system.
- The appliance can be used as a cache, recovering data from the cloud and either keeping a copy locally on the appliance when accessed or moving the data back from the cloud permanently if frequently accessed.
- Future implementations could enable the ability to restore VMs into the public cloud and have them run there directly. This extends the paradigm of those products offering data management capabilities rather than just being large archives.
A word about software
Not all of these types of products are based on hardware appliances. Both Actifio and Catalogic, for example, deliver software-based backup products that provide similar functionality to the Cohesity and Rubrik appliances. In the case of Catalogic, it simply harnesses the snapshot capabilities of the underlying storage platform.
The current evolution of backup technologies is one thread in a growing movement to provide more options around data mobility. Server virtualization and, more recently, containers have enabled the application to break away from the constraints of the physical server. Now we need data to have the same level of flexibility, something we are starting to see as backup targets become redefined.
Head to backup and recovery school
The latest in data backup technology
Leverage data backup technologies to improve data protection