When a virtual machine restoration fails, it is important to troubleshoot the problem quickly. The event logs are a good starting point, but sometimes the information in the event logs can be a bit cryptic and might
1. The backup was corrupt
One of the most common causes of restoration failures is a corrupt backup. Backup corruption can occur as a result of media failures, communication failures, or any number of other causes. The only way to protect your organization against backup corruption is to periodically test your backup and correct any problems that you might find.
Backup corruption can be really tough to troubleshoot. In these situations, the backup logs are going to be your best source of information. If the logs contain read errors then media problems might be to blame. Similarly, if the logs reflect a communications failure then you might have a bad cable or a bad I/O card.
2. The VSS is not running
Backups of Windows servers are generally based on the Volume Shadow Copy Service. The backup application typically acts as a VSS requester and sends a request to the VSS provider, whose job it is to coordinate the various VSS writers. The coordination of the provider and the writers facilitated by an operating system-level service called the Volume Shadow Copy Service. This service must be started in order for VSS to function correctly, and any required writers must also be running.
You can check the state of the Volume Shadow Copy Service by using the Service Control Manager. You can check the state of the individual VSS writers by using the VSSAdmin List Writers command.
3. The backup was taken at the file level
Most modern backup applications fully support virtual machine backups. However, these same backup applications generally also provide file-level backup capabilities. If a backup application is configured to back up a host server at the file level (as opposed to performing a virtual machine-specific backup) then there is a good chance that you will have restore failure because file-level backups of running virtual machines are almost guaranteed to be in an inconsistent state.
4. Critical VM components were omitted from the backup
Virtual machines consist of more than just virtual hard drives. There is critical metadata and configuration data associated with every virtual machine. This data identifies the virtual machine and its hardware resource allocations.
In some organizations, it is common practice to store virtual machine configuration data and snapshot data separately from the virtual hard disks. In those situations, the backup application must be aware of the decentralized nature of the various virtual machine components. Otherwise some virtual machine components might not get backed up, which would make a virtual machine-level restoration impossible. The quick and dirty way to find out what happened would be to check the logs. The best way to find out is to verify the integrity of the backups through testing.
5. Storage quotas are being exceeded
Another common cause of virtual machine restoration failures is that sometimes storage quotas can get in the way. This can be particularly problematic in private cloud environments in which each tenant is allocated a limited amount of storage. Unless the remaining storage allocation is sufficient, the restoration might be impossible without temporarily adjusting the quota.
6. The host runs low on memory
When a virtual machine is restored, there are certain resources that are consumed on the host server while the restoration is taking place. Generally the restore operation will consume disk and network I/O, CPU cycles and memory.
The problem with this is that some organizations attempt to achieve the highest possible virtual machine density on each host in order to maximize the return on their hardware investment. If the host server is already low on resources, then a restore operation might fail as a result. This is particularly true of situations in which there is not sufficient memory available.
7. A virtual backup appliance live migrates
Some organizations configure their server virtualization infrastructure to dynamically shift workloads among the available virtualization hosts in response to demand. This can sometimes be a problem if the backup application is running on a virtual appliance.
A virtual appliance should theoretically be able to live migrate to another host while a backup operation is running without causing any problems. Sometimes, however, the migration process can cause a momentary loss of connectivity, thereby causing currently running jobs to fail.
8. The application is protected against a restoration
Another reason why a restore operation may fail is because some applications are protected against restorations. Exchange Server is a classic example of such an application. If you attempt to restore a mailbox database, the restore operation will fail unless you specifically give Exchange Server permission to overwrite the database. Although this type of protective mechanism should not prevent a virtual machine-level restoration, it can get in the way of restoring individual applications or databases within a virtual machine. This can happen in physical, virtual or mixed environments.
9. The restoration conflicts with a running VM
A virtual machine restoration can sometimes fail if the virtual machine that is being restored somehow conflicts with a running virtual machine. The degree to which various types of conflicts can be handled varies from one backup application to another, but some of the conflicts that might cause a restoration failure include duplicate virtual machine names, duplicate virtual MAC addresses or duplicate operating system-level identifiers.
10. The underlying cause of the problem has not been resolved
Suppose for a moment that a VM becomes corrupt and you decide to restore that VM from backup. If the restoration fails, then it could be because the underlying cause of the corruption has not been addressed. For example, if the VM originally became corrupted as a result of a disk volume problem, but you did not take the time to fix the volume before attempting the restoration, then it is possible that the restoration could fail as a result of the volume's state.
As you can see, there are a number of different factors that can cause a restoration failure. If a restoration fails, it is a good idea to check the backup application's event logs for clues as to the cause of the problem.
Generally speaking, security-related log entries point to a password problem with the service account or a lack of sufficient permissions for either the backup operator or the service account. An agent failure or a more generalized failure message can be caused by problems with the Volume Shadow Copy Service. Likewise, communications failures typically point to hardware problems, while a read failure might indicate a bad tape or a dirty tape drive.
About the author:
Brien M. Posey, MCSE, has received Microsoft's MVP award for Exchange Server, Windows Server and Internet Information Server. Brien has served as CIO for a nationwide chain of hospitals and has been responsible for the department of information management at Fort Knox. You can visit Brien's personal website at www.brienposey.com.
This was first published in March 2014