Heartbeat, failover and quorum in a Windows, Linux or AIX cluster

How heartbeats and failover work in a cluster on Windows, Linux and AIX

The basic mechanism for synchronizing two servers and detecting server failures is the heartbeat, which is a monitoring data flow on a network shared by a pair of servers.

The SafeKit software supports as many heartbeats as there are networks shared by two servers. The heartbeat mechanism is used to implement Windows, Linux and AIX clusters. It is integrated within the SafeKit mirror cluster with real-time file replication and failover.

In normal operation, the two servers exchange their states (PRIM, SECOND, the resource states) through the heartbeat channels and synchronize their application start and stop procedures. In particular, in case of an application failover because of a software failure or a manual operation, the stop script which stops the application is first executed on the primary server, before executing the start script on the secondary server. Thus, replicated data on the secondary server are in a safe state corresponding to a clean stop of the application.

If all heartbeats are lost, it is interpreted as if the other server was down, and the local server switches to the ALONE state. If it is the SECOND server which goes to the ALONE state, then there is an application failover with restart of the application on the secondary server. Although not mandatory, it is better to have two heartbeat channels on two different networks for synchronizing the two servers in order to separate the network failure case from the server failure one.

Cluster quorum problem when servers are in two remote computer rooms

Most often, a HA cluster securing a critical application in a data center is implemented with two servers in two geographically remote computer rooms to support the disaster of a full room.

In situation of transient network isolation between both computer rooms, the split brain problem arises. Both servers may start the critical application.

With a hardware failover cluster, this situation must not arise because a double execution means a concurrent access on shared disks and a potential corruption of the critical application data. That's why a cluster quorum is implemented with a third quorum server or a special quorum disk or even a remote hardware reset to avoid this concurrent execution of the critical application.

Unfortunetly this new quorum devices add cost and complexity to the overall clustering architecture. And the system is not immune to a freeze of an OS: when the OS resumes from the freeze, there are a double execution of the application, even with the aforementioned mechanisms.

Simple cluster quorum with SafeKit

With the SafeKit HA software, the quorum within a Windows, Linux and AIX cluster requires no third quorum server, no quorum disk and no remote hardware reset. A simple splitbrain checker is sufficient for the SafeKit quorum to avoid the double execution of an application.

With no splitbrain checker, a SafeKit HA cluster supports a double execution of the critical application with no data corruption.
If there is an OS freeze or a network isolation with no network checker for the quorum, the primary server will continue to run the application in the ALONE state. And the secondary server will restart the application and will go also to the ALONE state. Replicated directories will be isolated and each running application will work on its own data in its own directory.

When the network is reconnected, a sacrifice must be made by shutting down the application on one of the two servers. This sacrifice shutdowns the application on one server and causes data reintegration from the primary one. After this reintegration, the data are once again in mirror mode between a primary and a secondary server.

All these operations are automatic. The complexity of the heartbeat, failover and quorum management within the cluster is integrated inside the SafeKit product and transparent for users of SafeKit. Thus, people deploying SafeKit without specific skill can do it on two standard servers in any configuration, local or remote. In addition, the configuration is the same for a Windows, Linux or AIX cluster.

Video to understand heartbeat, failover and quorum with the SafeKit software

This video is made with the old versions of Windows 2003 and SQL Server 2005. But SafeKit works with the new versions of Windows, Linux and AIX: click here for more information on the multi-platform mirror cluster.

More comparisons to understand how to achieve simplicity in high availability


White Papers



To receive Evidian news, please fill the following form.