Heartbeat, failover and quorum in a Windows or Linux cluster

Heartbeat, failover and quorum in a Windows or Linux cluster

How heartbeats and failover work in a cluster on Windows or Linux

The basic mechanism for synchronizing two servers and detecting server failures is the heartbeat, which is a monitoring data flow on a network shared by a pair of servers.

The SafeKit software supports as many heartbeats as there are networks shared by two servers. The heartbeat mechanism is used to implement Windows and Linux clusters. It is integrated within the SafeKit mirror cluster with real-time file replication and failover.

In normal operation, the two servers exchange their states (PRIM, SECOND, the resource states) through the heartbeat channels and synchronize their application start and stop procedures. In particular, in case of an application failover because of a software failure or a manual operation, the stop script which stops the application is first executed on the primary server, before executing the start script on the secondary server. Thus, replicated data on the secondary server are in a safe state corresponding to a clean stop of the application.

If all heartbeats are lost, it is interpreted as if the other server was down, and the local server switches to the ALONE state. If it is the SECOND server which goes to the ALONE state, then there is an application failover with restart of the application on the secondary server. Although not mandatory, it is better to have two heartbeat channels on two different networks for synchronizing the two servers in order to separate the network failure case from the server failure one.

Cluster quorum problem when servers are in two remote computer rooms

Most often, a HA cluster securing a critical application in a data center is implemented with two servers in two geographically remote computer rooms to support the disaster of a full room.

In situation of transient network isolation between both computer rooms, the split brain problem arises. Both servers may start the critical application.

With a hardware failover cluster, this situation must not arise because a double execution means a concurrent access on a shared storage and a potential corruption of the critical application data. That's why a cluster quorum is implemented with a third quorum server or a special quorum disk or even a remote hardware reset when possible to avoid this concurrent execution of the critical application.

Unfortunately this new quorum devices add cost and complexity to the overall clustering architecture. And the system is not immune to a freeze of an OS: when the OS resumes from the freeze, there are a double execution of the application, even with the aforementioned mechanisms and potentially with corruption of data on the shared storage.

Simple cluster quorum with SafeKit

With the SafeKit HA software, the quorum within a Windows or Linux cluster requires no third quorum server, no quorum disk and no remote hardware reset. A simple split brain checker is sufficient for the SafeKit quorum to avoid the double execution of an application.

The split brain checker, on the the loss of all heartbeats between servers, selects only one server to become the primary. The other server is not up-to-date anymore and goes into the WAIT state, until it receives the other server's heartbeats again, in which case it automatically resynchronizes the replicated data from the other server.

The primary server election is based on the ping of an IP address, called the witness. The witness is typically a router that does not crash. In case of network isolation, only the server with access to the witness will be primary ALONE, the other will go to WAIT. The witness is not tested permanently but only when the system switches over. If at the time of failover, the witness is down, the cluster goes into the WAIT-WAIT state and an administrator can choose to restart one of the nodes as primary through the SafeKit web interface.

Consider the critical case of an OS freeze or a network isolation without a split brain checker configured. A SafeKit high-availability cluster supports dual execution of a critical application without data corruption. In this case, the primary server continues to run the application in the ALONE state. And the secondary server restarts the application and also goes into the ALONE state. The replicated directories are isolated and each application is working on its own data in its own directory.

When the network is reconnected, a sacrifice must be made by shutting down the application on one of the two servers. This sacrifice shutdowns the application on one server and causes data reintegration from the primary one. After this reintegration, the data are once again in mirror mode between a primary and a secondary server.

All these operations are automatic. The complexity of the heartbeat, failover and quorum management within the cluster is integrated inside the SafeKit product and transparent for users of SafeKit. Thus, people deploying SafeKit without specific skill can do it on two standard servers in any configuration, local or remote. In addition, the configuration is the same for a Windows or a Linux cluster.

Important: if you choose another solution based on a shared or replicated disk, make sure that after an OS freeze, the server that comes out of the freeze can no longer access the shared or replicated disk, because two servers accessing the same disk via its file system leads to data corruption.

FAQ on Evidian SafeKit

Best use cases [+]

Customers [+]

Mirror cluster: demonstration, advantages [+]

Farm cluster: demonstration, advantages [+]

Application high availability modules [+]

Cloud solutions [+]

Comparison with other solutions [+]

SafeKit Webinar [+]

Pricing - Free trial [+]

contact
CONTACT
Demonstration

Evidian SafeKit Pricing





White Papers

NEWS

To receive Evidian news, please fill the following form.