In previous tips in this series, we introduced the idea that implementing availability requires taking a layered approach, and then following that layered approach, we looked at good system administrative practices, backups, disks and storage (and why larger disks aren't always better), networking, and system's local environment. The seventh level in the Availability Index (introduced in part one) addresses applications and services. The service we're going to look at today is the Network File System or NFS. A widely adopted protocol for sharing files from servers to clients, NFS is available in one form or another on most of the major platforms that are found in today's enterprise.
The issue I want to discuss this month is the use of soft mounts with NFS. Many system administrators believe that soft NFS mounts are their friend. In fact, soft NFS mounts should be avoided at pretty much all costs.
When an NFS client mounts an NFS file system from a server, one of the mount-time options is whether the mount should be soft or hard. Soft and hard only come into play when the server that is providing the mount stops responding. With a soft mount, after a settable timeout (usually lasting several minutes) has passed, any operations that were trying to read or write to the file system will give up and fail. When a mount is set to be hard, it will never time out. And disk operations are not interruptible; a control-C will not cause the operation to stop trying. Instead, when hard mounted, the operation will continue until it completes, or until the client attempting the operation is shut down.
The apparent advantage, then, is to soft mounts, since the client system cannot get hung as a result of a failure on its NFS server. That is, unfortunately, a shortsighted and incorrect point of view.
The truth is that if an NFS write can time out without completing, the result can be data corruption. Consider the following scenario of events:
Write 1: succeeds
Write 2: succeeds
NFS server crashes
Write 3: fails, due to time out
Write 4: fails, due to time out
NFS server recovers (very quickly)
Write 5: succeeds
Write 6: succeeds
When a user later attempts to read the data from the file with the failed writes, he'll be reading until he runs into the failed writes. Those positions in the file will have no data in them (or more accurately garbage data). They are, effectively, holes. When the application reads them, and attempts to act on the data, the application will surely fail, and depending on the nature of the application, it could take the whole system with it. What's more, the data that was supposed to be in the hole is lost; unless other steps were taken, it cannot be retrieved.
The write errors that come from failed soft mounted NFS file systems can be detected, but most developers do not write their code to check error codes from every single write.
Hard mounts may appear to be inconvenient because failures and timeouts cannot be interrupted. That lack of interruption is exactly what you want; it ensures that data gets written when and where you expect it written, and that failures and write errors get caught. Write errors cannot be glossed over, and left for discovery by production applications later on.
Copyright 2003, Evan Marcus
Evan L. Marcus is Data Availability Maven at VERITAS Software. Contact him at email@example.com.