isoga - Fotolia
Published: 06 Jun 2018
The container revolution was expected to herald a world of ephemeral container applications deployed and running across both public and private clouds. But the reality of enterprise data management is application data must be more persistent than the lifetime of a single container. As a result, vendors have added data persistence to container infrastructures, and that means container backup is now needed to protect container data like all other enterprise data.
Protecting container data has its own set of requirements and processes. This article explores how container backup is being done and what vendors are setting the standards for backing up containers.
The need for persistence
As containers have grown in popularity over the last four years, it has become clear that completely transient application execution isn't compatible with the enterprise -- at least not in the way IT is usually delivered. Even the 12-factor methodology for application development states that although applications should be stateless (factor six) and persisted, data ought to be stored on a backing service.
The most recent implementation of container technology, as standardized by Docker, stores application data on the same host as the container in a folder associated with the container. Lose the container, and you lose the data. Many companies assumed the plan for protecting container data would be to replicate data across many containers on resilient infrastructure. If any single container application failed, the data could be recreated from the remainder.
However, this approach is flawed in several ways. First, it assumes that container applications will always be running. What happens to data that isn't needed for a few days, weeks or months? Second, it doesn't account for the overhead involved in recreating failed data. Imagine a poorly coded 1 TB database application that crashes once a day. That means at least 1 TB of data transfer occurring each day that can affect production performance. It would be quicker and easier to start a new container and point it at existing data rather than replicate data across many containers. Third, because they're stateless, making it difficult to use technologies such as traditional backup products that need an agent, you would have to provide container backup and other data protection within the container.
The persistence of persistent data
- Despite initial expectations that containers would be ethereal and short-lived, actual use has shown that containerized applications must hold data persistently. Persistence has become a standard container feature.
- Enterprises need to know where data lives, who has access and how it is protected. Without persistence, enterprise-level standards of data management couldn't exist within container ecosystems.
- Containers provide multiple ways to store application data. These include mapping directories on a container host, connecting external storage and using volumes provided by container orchestration platforms like Docker.
The container industry and platforms have quickly changed to allow persistent volumes and file shares to be attached to container instances, even under the 12-factor methodology. External or shared storage on traditional storage platforms is starting to be connected to container environments. External storage provides a number of benefits. These include the following:
- Data lifetime isn't tied to the host the container runs on. With shared storage, the host running the containers can effectively become stateless.
- Shared storage can continue to offer the benefits of scalability, resiliency and performance as seen with traditional SAN and NAS platforms.
- You can use a range of media performance types and apply performance levels per container.
- External storage, especially SANs, offer data protection.
- External storage provides greater data mobility, especially with the public cloud.
Enterprises want data secured, protected and audited, and having container data on external storage meets all of these requirements.
Separating data and image
Before proceeding, it's important to clarify what container data means. A running container is comprised of two parts: the application code, such as a database or web server, and the data in the application. The application code or container image doesn't need to be stored on external or persistent media. Every time a container runs, its runtime environment checks to see if the image is available locally and, if not, downloads it.
This process is one of the key values of containers. Container applications can run anywhere, anytime as long as their images can be downloaded or are already locally stored.
It could be argued that container images require customization. For example, a web server might need specific performance settings or a database might need authorized users added. These changes can be scripted and automated, however, and easily applied to an image as it starts up. There's no reason to save a container image just for the configuration data. So the piece to be concerned with is the application data itself.
Container aren't VMs
The implementation of persistent data described so far makes containers sound much like virtual machines. A VM on traditional infrastructure, such as VMware, may well have a boot drive and one or more data drives. The VM and the data are usually linked for their lifetime, although it's possible to move a data volume to another VM.
Containers aren't virtual machines, however, and it's important to be aware of that when looking deeper into the container data model. Connecting external storage delivers the benefits already described in keeping data independent from the container application and shouldn't be used as a way of turning containers into VMs.
Data for containers
Container application data -- not the container itself -- still needs protection. Let's look at how container data is implemented to help clarify how container backup works:
- Within the container. On the Docker platform, data can be stored within the container image's file system. However, Docker uses a union file system to optimize the creation of images, so storing data within the container image is a bad idea. It makes container backup difficult because getting to the data means using the container application itself.
- As a volume. In the Docker ecosystem, a volume is stored on the local host where the container runs; on Linux, it's stored in the folder /var/lib/docker/volumes. The volume is mounted within the container at a specific mount point in the file system. Kubernetes also uses the concept of a volume, which is stored within a pod. A pod is an object that holds one or more containers and lives longer than the lifetime of a single container.
- As a bind mount. Docker allows the mounting of any host directory into a container. This is both good and bad. The host can have a file structure for holding persistent data, but system folders and files can also inadvertently, or deliberately, be mapped to and modified by a container.
- Through a plug-in. Docker and Kubernetes both provide volume plug-ins that map a LUN and volume to a container. This can be a volume or file share that existed before the container was created or one that was created at the same time as the container. Some storage vendors are using plug-ins as a way of providing storage directly to a container, either from software-defined or traditional platforms.
If your data lives within the container file system, then the answer is to either use the application or container to back up the data. This approach to container backup is feasible for a database platform where there's a standard backup mechanism, but, in general, it's not the most practical approach to running containers.
Volumes, like those implemented by Docker, could be backed up by taking a copy of the data from the host's file system. The same applies to bind mounts on the host. In both cases, the challenge is to match the name of the volume or mount to the application. Docker volumes can be created with arbitrary names, so a good naming standard is essential.
The problem with Docker volumes comes when you must restore data. If a volume still exists on the host, then it's fine to put data back into that directory with a standard restore. If the volume doesn't exist, then it would need to be recreated before data is reinserted. This is less of a problem for bind mounts, as the directory name is outside the Docker structure and can be backed up and restored at any time.
Plug-ins allow data on external storage to be mapped to a container. The Container Storage Interface, an open source initiative started by the leading container orchestration companies, is aiming to develop a consistent approach to this mapping process. The plug-in has the responsibility of performing any external work needed to map storage to the container host and then onto the container itself.
Many plug-in implementations focus on attaching a LUN and volume from external storage onto a container. As such, the volume could also be mounted into the host and backed up using traditional backup software. Alternatively, the volume could be protected using a snapshot on the storage array and remounted elsewhere from where it can be backed up.
Mounting snapshots to a backup server is an age-old technique for protecting data, but it means the backup will look like a crash copy when restored if the container app was running at the time. Again, naming LUNs or volumes appropriately will aid in being able to restore the data in the future. Plug-in definitions allow mapping existing volumes to a new container. If a LUN can be restored from a backup or snapshot, it's easy to present back to the host.
Container backup implementations
Here are some vendors' approaches to container backup available today. These include products, agents and plug-ins that provide access to container data.
Asigra has been doing Docker backup for years. Its cloud backup offering understands and integrates with the Docker file structure and includes a client component, DS-Client, which runs as a Docker container.
Blockbridge Networks has a Docker Volume Driver, now at release 4.0, that backs up and restores container data. This includes pushing the backup data to AWS S3 or another compatible provider.
Commvault Version 11 and onward support backing up both container images and container data. This is achieved by adding a virtualization client to each Docker host running containers.
NetApp Docker Volume Plugin, or nDVP, works with its OnTap, SolidFire Element OS and E-Series platforms. It lets you snapshot, clone and access volumes created from external systems that support a standard file system format.
Nimble Storage, part of Hewlett Packard Enterprise, has a Docker plug-in that provides import, clone and snapshot features.
Pure Storage has plug-ins for Docker and Mesos that offer support for FlashBlade and FlashArray.
Obviously, if container data is provided through directories on a container host, it's a simple matter to back up that data using a traditional backup platform. This means any existing backup product could be used for backing up container data. The caveat: They won't integrate with the container orchestration platform.
Container data needs protection like all enterprise data. The specifics of the data protection and container backup implementation depend on how the container ecosystem has been designed. But one thing is clear, because the backup process has moved away from the client and guest in the migration to VMs, containers won't directly provide the backup data to traditional backup applications. Instead, this can be achieved on the underlying orchestration platform, whether at the host or the storage layer.