File replication at byte level and application failover in a mirror cluster

File replication at byte level and application failover in a mirror cluster

A mirror cluster for critical database applications

A SafeKit mirror cluster with file replication at byte level provides a simple high availability solution to critical database applications. The SafeKit software implementing a mirror cluster runs either on Windows or Linux (even Windows editions for PCs). It implements synchronous real-time byte-level file replication. The resulting solution is working like a cluster connected to a replicated mirror SAN but without the costs and the complexity of hardware clustering solutions.

A mirror cluster: file replication at byte level and failover

The mirror cluster is a primary-backup high availability solution. The application runs on a primary server and is restarted automatically on a secondary server if the primary server fails. The software data replication is configured at the file level with the name of the file directories to replicate. The directory can contain database files or flat files. With synchronous byte-level file replication, this architecture is particularly suited to providing high availability for back-end applications with critical data to protect against failure. SafeKit provides a generic mirror module on Windows and Linux to build a mirror cluster as presented in the following video. You can write your own mirror module for your application. Microsoft SQL Server, MySQL, Oracle, PostgreSQL, Firebird are examples of mirror modules. And from a mirror module, you can also replicate a full Virtual Machine with automatic failover inside an Hyper-V cluster. Note that this article explains the difference between VM HA vs Application HA.

How the SafeKit mirror cluster works?

Step 1. File replication at byte level in a mirror cluster

File replication at byte level in a mirror cluster

Server 1 (PRIM) runs the application. Users are connected to the virtual IP address of the mirror cluster. SafeKit replicates files opened by the application in real time. Only changes made by the application in the files are replicated across the network, thus limiting traffic (byte-level file replication). With a software data replication at the file level, only names of file directories are configured in SafeKit. There are no pre-requisites on disk organization for the two servers. Directories to replicate may be located in the system disk. SafeKit implements synchronous replication with no data loss on failure contrary to asynchronous replication.

Step 2. Failover

Failover in a mirror cluster

When Server 1 fails, Server 2 takes over. SafeKit switches the cluster's virtual IP address and restarts the application automatically on Server 2. The application finds the files replicated by SafeKit uptodate on Server 2, thanks to the synchronous replication between Server 1 and Server 2. The application continues to run on Server 2 by locally modifying its files that are no longer replicated to Server 1. The failover time is equal to the fault-detection time (set to 30 seconds by default) plus the application start-up time. Unlike disk replication solutions, there is no delay for remounting file system and running file system recovery procedures.

Step 3. Failback and reintegration

Failback in a mirror cluster

Failback involves restarting Server 1 after fixing the problem that caused it to fail. SafeKit automatically resynchronizes the files, updating only the files modified on Server 2 while Server 1 was halted. This reintegration takes place without disturbing the applications, which can continue running on Server 2. The automatic failback is a major feature that differentiates SafeKit from other solutions, which require you to stop manually the applications on Server 2 in order to resynchronize Server 1.

In order to optimize file reintegration, different cases are considered:

  1. If SafeKit was cleanly stopped on server 1, then at its restart, only the modified zones of modified files are reintegrated, according to modification tracking bitmaps.
  2. If server 1 crashed (power off) or was incorrectly stopped (exception in the replication process), the modification bitmaps are not reliable, and are therefore discarded. All the files bearing a modification timestamp more recent than the last known synchronization point minus a graceful delay (typically one hour) are reintegrated.

Step 4. Return to byte-level file replication in the mirror cluster

Passive active mirror cluster with data replication

After reintegration, the files are once again in mirror mode, as in step 1. The system is back in high-availability mode, with the application running on Server 2 and SafeKit replicating data file updates to the backup Server 1. If the administrator wishes the application to run on Server 1, he/she can execute a "swap" command either manually at an appropriate time, or automatically through configuration.

Note that you can deploy several mirror modules on the same cluster and then implement an active-active cluster with crossed replication.

Key differentiators of a file replication and failover cluster with Evidian SafeKit

Evidian SafeKit mirror cluster with real-time file replication and failover

Synchronous replication Synchronous replication

Like  The real-time replication is synchronous with no data loss on failure

Dislike  This is not the case with asynchronous replication

After a server failure, fully automated failback procedure Automatic failback

Like  After a server failure and a failover, the replication failback procedure is fully automatic on the failed server and without stopping the application on the only remaining server

Dislike  This is not the case with most replication solutions particularly with replication at the database level. Manual operations are required for resynchronizing a failed server. The application may even be stopped on the only remaining server during the resynchonization of the failed server

All clustering features All clustering features

Like  The solution includes all clustering features: server failure monitoring, network failure monitoring, software failure monitoring, automatic application restart with a quick recovery time, a virtual IP address switched in case of failure to automatically reroute clients. A clustering configuration is simply made by means of a high availability application module. There is no domain controller or active directory to configure on Windows. The solution works on Windows and Linux

Dislike  This is not the case with replication-only solutions like replication at the database level

Dislike  Quick application restart is not ensured with full virtual machines replication. In case of hypervisor failure, a full VM must be rebooted on a new hypervisor with an unknown recovery time

Replication of any type of data

Like  The replication is working for databases but also for any files which shall be replicated

Dislike  This not the case for replication at the database level

File replication vs disk replication File replication vs disk replication

Like  The replication is based on file directories that can be located anywhere (even in the system disk)

Disike  This is not the case with disk replication where special application configuration must be made to put the application data in a special disk

File replication vs shared disk File replication vs shared disk

Like  The servers can be put in two remote sites

Dislike  This is not the case with shared disk solutions

Remote sites Remote sites

Like  With remote sites, the solution works with only 2 servers and for the quorum (network isolation), a simple split brain checker is offered

Like  This is not the case for most clustering solutions where a 3rd server is required for the quorum

Like  If both servers are connected to the same IP network through an extended LAN between two remote sites, the virtual IP address of SafeKit is working with rerouting at level 2

Like  If both servers are connected to two different IP networks between two remote sites, the virtual IP address can be configured at the level of a load balancer. SafeKit offers a health check: the load balancer is configured with a URL managed by SafeKit which returns OK on the primary server and NOT FOUND else. This solution is implemented for SafeKit in the Cloud but it can be also implemented with a load balancer on premise

Like  Virtual IP address is not offered by replication-only solution

Uniform high availability solution Uniform high availability solution

Like  SafeKit implements a mirror cluster with replication and failover. But it imlements also a farm cluster with load balancing and failover. Thus a N-tiers architecture can me made highly available and load balanced with the same solution on Windows and Linux (same installation, configuration, administration with the SafeKit console or with the command line interface). This is unique on the market

Dislike  This is not the case with an architecture mixing different technologies for load balancing, replication and failover

FAQ on Evidian SafeKit [+]