Heartbeat, failover and quorum in a Windows or Linux cluster

How heartbeats and failover work in a cluster on Windows or Linux

The basic mechanism for synchronizing two servers and detecting server failures is the heartbeat, which is a monitoring data flow on a network shared by a pair of servers.

The SafeKit software supports as many heartbeats as there are networks shared by two servers. The heartbeat mechanism is used to implement Windows and Linux clusters. It is integrated within the SafeKit mirror cluster with real-time file replication and failover.

In normal operation, the two servers exchange their states (PRIM, SECOND, the resource states) through the heartbeat channels and synchronize their application start and stop procedures. In particular, in case of an application failover because of a software failure or a manual operation, the stop script which stops the application is first executed on the primary server, before executing the start script on the secondary server. Thus, replicated data on the secondary server are in a safe state corresponding to a clean stop of the application.

If all heartbeats are lost, it is interpreted as if the other server was down, and the local server switches to the ALONE state. If it is the SECOND server which goes to the ALONE state, then there is an application failover with restart of the application on the secondary server. Although not mandatory, it is better to have two heartbeat channels on two different networks for synchronizing the two servers in order to separate the network failure case from the server failure one.

Cluster quorum problem when servers are in two remote computer rooms

Most often, a HA cluster securing a critical application in a data center is implemented with two servers in two geographically remote computer rooms to support the disaster of a full room.

In situation of transient network isolation between both computer rooms, the split brain problem arises. Both servers may start the critical application.

With a hardware failover cluster, this situation must not arise because a double execution means a concurrent access on a shared storage and a potential corruption of the critical application data. That's why a cluster quorum is implemented with a third quorum server or a special quorum disk or even a remote hardware reset when possible to avoid this concurrent execution of the critical application.

Unfortunately this new quorum devices add cost and complexity to the overall clustering architecture. And the system is not immune to a freeze of an OS: when the OS resumes from the freeze, there are a double execution of the application, even with the aforementioned mechanisms and potentially with corruption of data on the shared storage.

Simple cluster quorum with SafeKit

With the SafeKit HA software, the quorum within a Windows or Linux cluster requires no third quorum server, no quorum disk and no remote hardware reset. A simple split brain checker is sufficient for the SafeKit quorum to avoid the double execution of an application.

The split brain checker, on the the loss of all heartbeats between servers, selects only one server to become the primary. The other server is not up-to-date anymore and goes into the WAIT state, until it receives the other server's heartbeats again, in which case it automatically resynchronizes the replicated data from the other server.

The primary server election is based on the ping of an IP address, called the witness. The network topology must be designed so that only one server can ping the witness in case of split brain. If this is not the case, both servers will go primary.

With no split brain checker, a SafeKit HA cluster supports a double execution of the critical application with no data corruption.
If there is an OS freeze or a network isolation with no split brain checker for the quorum, the primary server will continue to run the application in the ALONE state. And the secondary server will restart the application and will go also to the ALONE state. Replicated directories will be isolated and each running application will work on its own data in its own directory.

When the network is reconnected, a sacrifice must be made by shutting down the application on one of the two servers. This sacrifice shutdowns the application on one server and causes data reintegration from the primary one. After this reintegration, the data are once again in mirror mode between a primary and a secondary server.

All these operations are automatic. The complexity of the heartbeat, failover and quorum management within the cluster is integrated inside the SafeKit product and transparent for users of SafeKit. Thus, people deploying SafeKit without specific skill can do it on two standard servers in any configuration, local or remote. In addition, the configuration is the same for a Windows or Linux cluster.

Examples of cluster configuration on Windows and Linux with HA modules

Mirror modules



Microsoft SQL ServerWindows module-
OracleWindows moduleLinux module
MySQLWindows moduleLinux module
PostgreSQLWindows moduleLinux module
FirebirdWindows moduleLinux module
Hyper-VWindows module-
Milestone XProtect (based on Microsoft SQL Server)Windows module-
Hanwha SSM (based on PostgreSQL)Windows module-
Generic moduleWindows moduleLinux module

Farm modules



IIS moduleWindows module-
Apache moduleWindows moduleLinux module
Generic moduleWindows moduleLinux module

Other considerations: how to reconcile simplicity and high availability?


Evidian SafeKit Pricing

White Papers


To receive Evidian news, please fill the following form.