If you've done anything interesting with highly available systems (including but not limited to systems built on Heartbeat and/or Pacemaker), you will have encountered the need to fence misbehaving or otherwise broken nodes. One approach is to kill the (allegedly misbehaving) node, i.e. Shoot The Other Node In The Head, or STONITH for short. In a two-node HA cluster, you end up with hardware that looks something like this:
Unfortunately, it's possible to wind up in a situation where each node believes the other to be broken; the first node shoots the second, then when the second reboots, it shoots the first, and so on, ad infinitum, until you realise that perhaps a single non-HA node would have been both cheaper and more reliable. This can aptly be referred to as a state of STONITH deathmatch.
The remainder of this document focuses specifically on Heartbeat/Pacemaker HA clusters; while similar principles may apply to other software stacks, the specifics will likely vary.
In the case of Heartbeat/Pacemaker HA clusters, there are basically three reasons for one node to STONITH the other:
The first cause can – and should – be mitigated by ensuring redundant communication paths exist between all nodes in the cluster, and that your network switch(es) handle multicast properly.
The second cause is fairly obvious, and unlikely to be the cause of STONITH deathmatch; nothing here will make the soon-to-be-dead node think its partner is also in need of killing.
The third is perhaps not so straightforward. The specifics of the following may vary depending on your configuration, but roughly, here's how the game is played:
Given this chain of events, it is critically important that creators of resource agents (i.e. the scripts that start, stop and monitor HA resources) ensure that stop operations always succeed, unless the resource cannot actually be stopped. Here's a contrived example of how not to do this, for an HA filesystem resource:
#
# This is a contrived example. Do not do this in real life.
#
start()
{
if mount $DEVICE $MOUNTPOINT
then
return $OCF_SUCCESS
else
return $OCF_ERR_GENERIC
fi
}
stop()
{
if umount $MOUNTPOINT
then
return $OCF_SUCCESS
else
# This is broken!
return $OCF_ERR_GENERIC
fi
}
The contrived example is broken because:
Simple: Don't try to unmount the filesystem if it's not already mounted. Optimally, fish around in /proc/mounts if it's available. If not, try checking the output of mount $MOUNTPOINT.
More generically, the goal here is to find the cheapest, most efficient means of checking whether an HA resource is already stopped, then return success if it is already stopped. Only if it's not already stopped should your resource agent attempt to actually stop it, and thus possibly result in a failure and subsequent STONITH.
It is also important to avoid metafailures; for example a simple syntax error in the stop script can ultimately result in what appears to be a failure, causing a STONITH for entirely the wrong reason.
Several things, but really it boils down to these two:
In Heartbeat/Pacemaker clusters, all operations (start/monitor/stop) have a timeout. If this timeout elapses prior to the completion of the operation, the operation is considered failed.
So, let's try another contrived example: Assume you have a highly available filesystem resource, and your stop timeout is set to 30 seconds. Now imagine you're about to stop the filesystem, but you've got a whole lot of dirty data that's going to be flushed as part of the unmount. In this example, the unmount is going to take longer than 30 seconds.
1, 2, 3, ... 29, 30, *BANG* You're dead. For no good reason. And half your dirty data wasn't flushed to disk. Whoever owns that data is not going to be happy, and what's worse, the stop probably would have succeeded if we hadn't hit that timeout.
One solution is to ensure that your stop timeouts always exceed the duration of the longest possible successful stop. One issue with this solution is that it will increase your "best worst case failover time". Depending on your application this may or may not be a problem; either way, you need to do the numbers – see below for details.
It's difficult to ensure this section is exhaustive; I certainly can't write about the things I didn't think of. That aside, here's yet another contrived example that illustrates a potentially non-obvious problem:
#
# This is yet another contrived example. Do not do this in real life.
#
stop()
{
# This is broken for several reasons, but they might not be obvious
if df | grep -q $MOUNTPOINT
then
if umount $MOUNTPOINT
then
return $OCF_SUCCESS
else
return $OCF_ERR_GENERIC
fi
else
# filesystem is not in df output, thus was not mounted,
# thus stop is successful
return $OCF_SUCCESS
fi
}
The intent here is good: check if the filesystem is mounted; if it's not already mounted, return success. If it is mounted, unmount it and return success if the unmount succeeds. Most of the time it will actually work.
The problem (aside from that grep being way too loose – it'll partial match on similar mountpoints) is that df examines all mounted devices. It can block for a while if some mounted filesystem is under heavy load. It can block forever if some mounted filesystem has disappeared completely (e.g. a remote NFS mount). And then you're back in timeout land, and then you're dead.
Even if you only run df on the mountpoint you care about, it's still going to hit the disks; if the filesystem you're looking at is under load, that df might take a while, which is why you're better off looking in /proc/mounts... But that's not the point.
The point is: you are trying to figure out if you can return success for a stop operation without relying on any other system state outside the resource you are trying to stop. If you can return success, great! If you can't, you then need to stop that resource without relying on any components of the system other than those absolutely necessary to effect the stop.
The "best worst case failover time" is the least amount of time it will take, under the most adverse conditions, for a highly available resource to fail on one node, restart on another node, and become accessible to client systems again.
Put another way, it's the maximum potential downtime you need to mention in the fine print of any sales contract.
Bearing in mind the flexibility available with resource constraint scores, we need a fourth contrived example. Imagine a two-node HA cluster, running a single HA resource, configured such that a failure of that resource on one node will trigger a migration to the other node, and vice versa, until eventually the resource becomes unrunnable. Assume that we have timeouts and intervals set as follows:
Here's the sequence of events involved in a worst case failover:
Add all the numbers together, and the best worst case failover time is:
20 (monitor interval) + 30 (monitor timeout) + 60 (stop timeout) + 20 (start timeout) = 130 seconds
An average failover might be 10 seconds of monitor interval, plus successful monitor failure, stop success and start success of a second each, giving a failover time of 13 seconds to brag about. And that's really good. But bragging rights don't help if the system breaks in a way you didn't think of, and your client is not aware that the best worst case failover time is an order of magnitude larger than what you quoted as a happy general case.
Look at the resource constraints, scores, intervals and timeouts. Do the numbers. Become afraid. It's worth it.
Tricky, but not impossible. Life will be easier if you can: