Cloud: The Simplest High Availability Cluster with Synchronous Replication and Failover

: The Simplest High Availability Cluster with Synchronous Replication and Failover

Evidian SafeKit provides a high availability cluster with real-time replication and failover in Cloud. This article explains how to implement quickly such a cluster in Cloud. A free trial is offered in the installation instructions section.

This clustering solution is recognized as the simplest to implement by our customers and partners. It is also a complete solution that solves hardware failures (20% of problems) including the complete failure of a computer room, software failures (40% of problems) including software error detection and automatic restart and human errors (40% of problems) thanks to its simplicity.

How the Evidian SafeKit software simply implements high availability with real-time synchronous replication and failover in Cloud?

How the Evidian SafeKit mirror cluster implements real-time replication and failover in Cloud?

On the previous figure,

On the previous figure, the server 1/PRIM runs the critical application. Users are connected to the virtual IP address of the mirror cluster. SafeKit replicates files opened by the critical application in real time. Only changes in the files are replicated across the network, thus limiting traffic (byte-level file replication). Names of file directories containing critical data are simply configured in SafeKit. There are no pre-requisites on disk organization for the two servers. Directories to replicate may be located in the system disk. SafeKit implements synchronous replication with no data loss on failure contrary to asynchronous replication.

In case of server 1 failure, there is an automatic failover on server 2 with restart of the critical application. Then, when server 1 is restarted, SafeKit implements automatic failback with reintegration of data without stopping the critical application on server 2. Finally, the system returns to synchronous replication between server 2 and server 1. The administrator can decide to swap the role of primary and secondary and return to a server 1 running the critical application. The swap can also be done automatically by configuration.

All cloud solutions

Click on the mirror/farm architecture to understand and try the solution

Cloud

Real-time replication and failover cluster

Load balancing and failover cluster

Amazon AWS

Microsoft Azure

Google GCP

Generic Architecture

FAQ on Evidian SafeKit

Best use cases [+]

Customers [+]

Mirror cluster: advantages [+]

Application high availability modules [+]

Comparison with other solutions [+]

SafeKit Webinar [+]

Pricing - Free trial [+]

Mirror cluster in Cloud: installation on existing Cloud virtual machines (Windows or Linux)

Configuration of the Cloud load balancer

The load balancer must be configured to periodically send health packets to virtual machines. For that, SafeKit provides a health check which runs inside the virtual machines and which

You must configure the Cloud load balancer with:

For more information, see the configuration of the Cloud load balancer.

Configuration of the Cloud network security

The network security must be configured to enable communications for the following protocols and ports:

Package installation on Windows

On both Windows servers

Package installation on Linux

On both Linux servers

Configuration instructions

The configuration is presented with the web console connected to 2 Windows servers but it is the same thing with 2 Linux servers.

Important: all the configuration is made from a single browser.

It is recommended to configure the web console in the https mode by connecting to https://<IP address of 1 VM>:9453 (next image). In this case, you must configure before the https mode by using the wizard described in the User's Guide: see "11.1 HTTPS Quick Configuration with the Configuration Wizard".

Start the https SafeKit web console for configuring

Or you can use the web console in the http mode by connecting to http://<IP address of 1 VM>:9010 (next image).

Start the SafeKit web console for configuring

Note that you can also make a configuration with DNS names, especially if the IP addresses are not static.

Enter IP address of the first node and click on Confirm (next image)

SafeKit web console - first node in the  cluster

Click on New node and enter IP address of the second node (next image)

SafeKit web console - second node in the  cluster

Click on the red floppy disk to save the configuration (previous image)

In the Configuration tab, click on mirror.safe then enter mirror as the module name and Confirm: next images with mirror instead of xxx

SafeKit web console - start configuration of  module SafeKit web console - enter  module name

Click on Validate (next image)

SafeKit web console - enter  module nodes

Change the path of replicated directories only if necessary (next image).

Do not configure a virtual IP address (next image) because this configuration is already made in the Cloud load balancer. This section is useful for on-premise configuration only.

If a process is defined in the Process Checker section (next image), it will be monitored on the primary server with the action restart in case of failure. The services will be stopped an restarted locally on the primary server if this process disappears from the list of running processes. After 3 unsuccessful local restarts, the module is stopped on the local server and there is a failover on the secondary server. As a consequence, the health check answers OK on the new primary server to the Cloud load balancer and the virtual IP address traffic is switched to the new primary server.

start_prim and stop_prim (next image) contain the start and the stop of services.

SafeKit web console - enter  parameters

Note:

Click on Validate (previous image)

SafeKit web console - stop the  module before configuration the configuration

Click on Configure (previous image)

SafeKit web console - check the success green message of the  configuration

Check the success green message on both servers and click on Next (previous image). On Linux, you may have an error at this step if replicated directories are mount points. See this article to solve the problem.

SafeKit web console - select the  node with the up-to-date database

Select the node with the most up-to-date replicated directories and click on start it to make the first resynchronization in the right direction (previous image). Before this operation, we suggest you to make a copy of replicated directories before starting the cluster to avoid any errors.

SafeKit web console - the first  node starts as primary and is alone

Start the second node (previous image) which becomes SECOND green (next image) after resynchronisation of all replicated directories (binary copy from node 1 to node 2).

SafeKit web console - the second  node starts as SECOND

The cluster is operational with services running on the PRIM node and nothing running on the SECOND node (previous image). Only modifications inside files are replicated in real-time in this state.

Be careful, components which are clients of the services must be configured with the virtual IP address. The configuration can be made with a DNS name (if a DNS name has been created and associated with the virtual IP address).

Tests

Check with Windows Microsoft Management Console (MMC) or with Linux command lines that the services are started on the primary server and stopped on the secondary server.

Stop the PRIM node by scrolling down the menu of the primary node and by clicking on Stop. Check that there is a failover on the SECOND node. And check the failover of services with Windows Microsoft Management Console (MMC) or with Linux command lines.

To understand what happens in the cluster, check the SafeKit logs of the primary server and the secondary server.

To see the module log of the primary server (next image):

SafeKit web console - Module Log of the PRIM  server

To see the application log of the primary server (next image):

SafeKit web console - Application Log of the PRIM  server

To see the logs of the secondary server (previous image), click on W12R2server75/SECOND (it will become blue) on the left side and repeat the same operations. In the secondary module log, you will find the volume and the reintegration time of replicated data.

Advanced configuration

In Advanced Configuration tab (next image), you can edit internal files of the module: bin/start_prim and bin/stop_prim and conf/userconfig.xml (next image on the left side). If you make change in the internal files here, you must apply the new configuration by a right click on the blue icon/xxx on the left side (next image): the interface will allow you to redeploy the modified files on both servers.

SafeKit web console - Advanced configuration of  module

Configure boot start (next image on the right side) configures the automatic boot of the module when the server boots. Do this configuration on both servers once the high availability solution is correctly running.

SafeKit web console - Automatic boot of  module

Support

For getting support on the call desk of https://support.evidian.com, get 2 Snaphots (2 .zip files), one for each server and upload them in the call desk tool (next image).

SafeKit web console -  snaphots for support

Internal files of the Windows mirror.safe module

userconfig.xml

<!DOCTYPE safe>
<safe>
   <service mode="mirror" defaultprim="alone" maxloop="3" loop_interval="24" failover="on">
      <!-- Server Configuration -->
      <!-- Names or IP addresses on the default network are set during initialization in the console -->
      <heart pulse="700" timeout="30000">
         <heartbeat name="default" ident="flow"/>
      </heart>
      <!-- Software Error Detection Configuration -->
      <!-- Replace
         * PROCESS_NAME by the name of the process to monitor
      -->
      <errd polltimer="10">
        <proc name="PROCESS_NAME" atleast="1" action="restart" class="prim" />
      </errd>
      <!-- File Replication Configuration -->
      <rfs async="second" acl="off" nbrei="3">
         <replicated dir="c:\test1replicated" mode="read_only"/>
         <replicated dir="c:\test2replicated" mode="read_only"/>
      </rfs>
      <!-- User scripts activation -->
      <user nicestoptimeout="300" forcestoptimeout="300" logging="userlog"/>
   </service>
</safe>

start_prim.cmd

@echo off

rem Script called on the primary server for starting application services

rem For logging into SafeKit log use:
rem “%SAFE%\safekit” printi | printe "message"

rem stdout goes into Application log
echo "Running start_prim %*"

set res=0

rem Fill with your services start call

set res=%errorlevel%

if %res% == 0 goto end

:stop
“%SAFE%\safekit” printe "start_prim failed"

rem uncomment to stop SafeKit when critical
rem “%SAFE%\safekit” stop -i "start_prim"

:end

stop_prim.cmd

@echo off

rem Script called on the primary server for stopping application services

rem For logging into SafeKit log use:
rem “%SAFE%\safekit” printi | printe "message"

rem ----------------------------------------------------------
rem
rem 2 stop modes:
rem
rem - graceful stop
rem call standard application stop with net stop
rem
rem - force stop (%1=force)
rem kill application's processes
rem
rem ----------------------------------------------------------

rem stdout goes into Application log
echo "Running stop_prim %*"

set res=0

rem default: no action on forcestop
if "%1" == "force" goto end

rem Fill with your services stop call

rem If necessary, uncomment to wait for the stop of the services
rem “%SAFEBIN%\sleep” 10

if %res% == 0 goto end

“%SAFE%\safekit” printe "stop_prim failed"

:end

Internal files of the Linux mirror.safe module

userconfig.xml

<!DOCTYPE safe>
<safe>
   <service mode="mirror" defaultprim="alone" maxloop="3" loop_interval="24" failover="on">
      <!-- Server Configuration -->
      <!-- Names or IP addresses on the default network are set during initialization in the console -->
      <heart pulse="700" timeout="30000">
         <heartbeat name=”default” ident=”flow”/>
      </heart>
      <!-- Software Error Detection Configuration -->
      <!-- Replace
         * PROCESS_NAME by the name of the process to monitor
      -->
      <errd polltimer="10">
        <proc name="PROCESS_NAME" atleast="1" action="restart" class="prim" />
      </errd>
      <!-- File Replication Configuration -->
      <rfs mountover="off" async="second" acl="off" nbrei="3" >
         <replicated dir="/test1replicated" mode="read_only"/>
         <replicated dir="/test2replicated" mode="read_only"/>
      </rfs>
      <!-- User scripts activation -->
      <user nicestoptimeout="300" forcestoptimeout="300" logging="userlog"/>
   </service>
</safe>

start_prim

#!/bin/sh
# Script called on the primary server for starting application

# For logging into SafeKit log use:
# $SAFE/safekit printi | printe "message" 

# stdout goes into Application log
echo "Running start_prim $*" 

res=0

# Fill with your application start call

if [ $res -ne 0 ] ; then
  $SAFE/safekit printe "start_prim failed"

  # uncomment to stop SafeKit when critical
  # $SAFE/safekit stop -i "start_prim"
fi

stop_prim

#!/bin/sh
# Script called on the primary server for stopping application

# For logging into SafeKit log use:
# $SAFE/safekit printi | printe "message" 

#----------------------------------------------------------
#
# 2 stop modes:
#
# - graceful stop
#   call standard application stop
#
# - force stop ($1=force)
#   kill application's processes
#
#----------------------------------------------------------

# stdout goes into Application log
echo "Running stop_prim $*" 

res=0

# default: no action on forcestop
[ "$1" = "force" ] && exit 0

# Fill with your application stop call

[ $res -ne 0 ] && $SAFE/safekit printe "stop_prim failed"
contact
CONTACT
Demonstration

Evidian SafeKit Pricing



White Papers

NEWS

To receive Evidian news, please fill the following form.