Cloud: The Simplest High Availability Cluster with Synchronous Real-Time Replication and Failover
Evidian SafeKit
The solution in Cloud
Evidian SafeKit brings high availability in Cloud between two Windows or Linux redundant servers.
This article explains how to implement quickly a Cloud cluster without shared disk and without specific skills.
A generic product
Note that SafeKit is a generic product on Windows and Linux.
You can implement with the same product real-time replication and failover of any file directory and service, database, complete Hyper-V or KVM virtual machines, Docker, Kubernetes , Cloud applications.
Architecture
How it works in Cloud?
- The servers are running in different availability zones.
- The critical application is running on the PRIM server.
- Users are connected to a primary/secondary virtual IP address which is configured in the Cloud load balancer.
- SafeKit provides a generic health check for the load balancer.
On the PRIM server, the health check returns OK to the load balancer and NOK on the SECOND server. - In each server, SafeKit monitors the critical application with process checkers and custom checkers.
- SafeKit restarts automatically the critical application when there is a software failure or a hardware failure thanks to restart scripts.
- SafeKit makes synchronous real-time replication of files containing critical data.
- A connector for the SafeKit web console is installed in each server.
Thus, the high availability cluster can be managed in a very simple way to avoid human errors.
Partners, the success with SafeKit
This platform agnostic solution is ideal for a partner reselling a critical application and who wants to provide a redundancy and high availability option easy to deploy to many customers.
With many references in many countries won by partners, SafeKit has proven to be the easiest solution to implement for redundancy and high availability of building management, video management, access control, SCADA software...
Building Management Software (BMS)
Video Management Software (VMS)
Electronic Access Control Software (EACS)
SCADA Software (Industry)
Discover SafeKit in Google GCP
Manual installation in Cloud of a high availability cluster with synchronous replication and failover (Windows or Linux)
- Configuration of the Cloud load balancer
- Configuration of the Cloud network security
- Package installation on Windows
- Package installation on Linux
- Configuration of SafeKit
- Tests
Configuration of the Cloud load balancer
The load balancer must be configured to periodically send health packets to virtual machines. For that, SafeKit provides a health check which runs inside the virtual machines and which
- returns OK when the mirror module state is PRIM (green) or ALONE (green)
- returns NOT FOUND in all other states
You must configure the Cloud load balancer with:
- HTTP protocol
- port 9010, the SafeKit web server port
- URL /var/modules/mirror/ready.txt (if mirror is the module name that you will deploy later)
For more information, see the configuration of the Cloud load balancer.
Configuration of the Cloud network security
The network security must be configured to enable communications for the following protocols and ports:
- UDP - 4800 for the safeadmin service (between SafeKit nodes)
- UDP - 8888 for the module heartbeat (between SafeKit nodes)
- TCP – 5600 for the module real time file replication (between SafeKit nodes)
- TCP – 9010 for the load-balancer health check and for the SafeKit web console running in the http mode
- TCP – 9001 to configure the https mode for the console
- TCP – 9453 for the SafeKit web console running in https mode
Package installation on Windows
On both Windows servers
- Install the free version of SafeKit for Cloud (click here) on 2 Windows nodes
- The module mirror.safe is delivered inside the package.
- To open firewall, start a command line as administrator, goto C:\safekit\private\bin and type .\firewallcfg.cmd add on both nodes
Package installation on Linux
On both Linux servers
- Install the free version of SafeKit for Cloud (click here) on 2 Linux nodes
- After the download of safekit_xx.bin package, execute it to extract the rpm and the safekitinstall script and then execute the safekitinstall script
- Answer yes to firewall automatic configuration
- The module mirror.safe is delivered inside the package.
Configuration of SafeKit
The configuration is presented with the web console connected to 2 Windows servers but it is the same thing with 2 Linux servers.
Important: all the configuration is made from a single browser.
It is recommended to configure the web console in the https mode by connecting to https://<IP address of 1 VM>:9453 (next image). In this case, you must configure before the https mode by using the wizard described in the User's Guide: see "11. Securing the SafeKit web console".
Or you can use the web console in the http mode by connecting to http://<IP address of 1 VM>:9010 (next image).
Note that you can also make a configuration with DNS names, especially if the IP addresses are not static.
Enter IP address of the first node and click on Confirm (next image)
Click on New node and enter IP address of the second node (next image)
Click on the red floppy disk to save the configuration (previous image)
In the Configuration tab, click on mirror.safe then enter mirror as the module name and Confirm: next images with mirror instead of xxx
Click on Validate (next image)
Change the path of replicated directories only if necessary (next image).
Do not configure a virtual IP address (next image) because this configuration is already made in the Cloud load balancer. This section is useful for on-premise configuration only.
If a process is defined in the Process Checker section (next image), it will be monitored on the primary server with the action restart in case of failure. The services will be stopped an restarted locally on the primary server if this process disappears from the list of running processes. After 3 unsuccessful local restarts, the module is stopped on the local server and there is a failover on the secondary server. As a consequence, the health check answers OK on the new primary server to the Cloud load balancer and the virtual IP address traffic is switched to the new primary server.
start_prim and stop_prim (next image) contain the start and the stop of services.
Note:
- on Windows, put services with Boot Startup Type = Manual on both servers (SafeKit controls start of services in start_prim).
Click on Validate (previous image)
Click on Configure (previous image)
Check the success green message on both servers and click on Next (previous image). On Linux, you may have an error at this step if replicated directories are mount points. See this article to solve the problem.
Select the node with the most up-to-date replicated directories and click on start it to make the first resynchronization in the right direction (previous image). Before this operation, we suggest you to make a copy of replicated directories before starting the cluster to avoid any errors.
Start the second node (previous image) which becomes SECOND green (next image) after resynchronisation of all replicated directories (binary copy from node 1 to node 2).
The cluster is operational with services running on the PRIM node and nothing running on the SECOND node (previous image). Only modifications inside files are replicated in real-time in this state.
Be careful, components which are clients of the services must be configured with the virtual IP address. The configuration can be made with a DNS name (if a DNS name has been created and associated with the virtual IP address).
Tests
Check with Windows Microsoft Management Console (MMC) or with Linux command lines that the services are started on the primary server and stopped on the secondary server.
Stop the PRIM node by scrolling down the menu of the primary node and by clicking on Stop. Check that there is a failover on the SECOND node. And check the failover of services with Windows Microsoft Management Console (MMC) or with Linux command lines.
More information on tests in the User's Guide
Automatic start of the module at boot
Configure boot start (next image on the right side) configures the automatic boot of the module when the server boots. Do this configuration on both servers once the high availability solution is correctly running.
Note that on Windows, with Windows services manager, we assume that services are with Boot Startup Type = Manual on both nodes. SafeKit controls start of services when starting the module in start_prim.
Note that for synchronizing SafeKit at boot and at shutdown on Windows, we assume that a command line has been run on both nodes during installation as administrator: .\addStartupShutdown.cmd in C:\safekit\private\bin (otherwise dot it now).
For reading the SafeKit logs, go to the Troubleshooting tab
For editing userconfig.xml, start_prim and stop_prim, go to the Advanced Configuration tab
Troubleshooting with the SafeKit module and application logs
Module log
Read the module log to understand the reasons of a failover, of a waiting state on the availability of a resource etc...
To see the module log of the primary server (next image):
- click on the Control tab
- click on node 1/PRIM (it becomes blue) on the left side to select the server
- click on Module Log
- click on the Refresh icon (green arrows) to update the console
- click on the floppy disk to save the module log in a .txt file and to analyze in a text editor
Repeat the same operation to see the module log of the secondary server.
Application log
Read the application log to see the output messages of the stat_prim and stop_prim restart scripts.
To see the application log of the primary server (next image):
- click on the Control tab
- click on node 1/PRIM (it becomes blue) on the left side to select the server
- click on Application Log to see messages when starting and stopping services
- click on the Refresh icon (green arrows) to update the console
- click on the floppy disk to save the application log in a .txt file and to analyze in a text editor
Repeat the same operation to see the application log of the secondary server.
More information on troubleshooting in the User's Guide
For support, open the Support section
Advanced configuration of SafeKit for implementing the high availability cluster
In Advanced Configuration tab (next image), you can edit internal files of the module: bin/start_prim and bin/stop_prim and conf/userconfig.xml (next image on the left side). If you make change in the internal files here, you must apply the new configuration by a right click on the icon/xxx on the left side (next image): the interface will allow you to redeploy the modified files on both servers.
Training and documentation here
For an example of userconfig.xml, start_prim and stop_prim, open the Internals section below
Support of SafeKit
For getting support on the call desk of https://support.evidian.com, get 2 Snaphots (2 .zip files), one for each server and upload them in the call desk tool (next image).
Internals of a SafeKit / Amazon AWS high availability cluster with synchronous replication and failover
Go to the Advanced Configuration tab, for editing these filesInternal files of the Windows mirror.safe module
userconfig.xml on Windows (description in the User's Guide)<!DOCTYPE safe>
<safe>
<service mode="mirror" defaultprim="alone" maxloop="3" loop_interval="24" failover="on">
<!-- Server Configuration -->
<!-- Names or IP addresses on the default network are set during initialization in the console -->
<heart pulse="700" timeout="30000">
<heartbeat name="default" ident="flow"/>
</heart>
<!-- Software Error Detection Configuration -->
<!-- Replace
* PROCESS_NAME by the name of the process to monitor
-->
<errd polltimer="10">
<proc name="PROCESS_NAME" atleast="1" action="restart" class="prim" />
</errd>
<!-- File Replication Configuration -->
<rfs async="second" acl="off" nbrei="3">
<replicated dir="c:\test1replicated" mode="read_only"/>
<replicated dir="c:\test2replicated" mode="read_only"/>
</rfs>
<!-- User scripts activation -->
<user nicestoptimeout="300" forcestoptimeout="300" logging="userlog"/>
</service>
</safe>
start_prim.cmd on Windows
@echo off
rem Script called on the primary server for starting application services
rem For logging into SafeKit log use:
rem “%SAFE%\safekit” printi | printe "message"
rem stdout goes into Application log
echo "Running start_prim %*"
set res=0
rem Fill with your services start call
set res=%errorlevel%
if %res% == 0 goto end
:stop
“%SAFE%\safekit” printe "start_prim failed"
rem uncomment to stop SafeKit when critical
rem “%SAFE%\safekit” stop -i "start_prim"
:end
stop_prim.cmd on Windows
@echo off
rem Script called on the primary server for stopping application services
rem For logging into SafeKit log use:
rem “%SAFE%\safekit” printi | printe "message"
rem ----------------------------------------------------------
rem
rem 2 stop modes:
rem
rem - graceful stop
rem call standard application stop with net stop
rem
rem - force stop (%1=force)
rem kill application's processes
rem
rem ----------------------------------------------------------
rem stdout goes into Application log
echo "Running stop_prim %*"
set res=0
rem default: no action on forcestop
if "%1" == "force" goto end
rem Fill with your services stop call
rem If necessary, uncomment to wait for the stop of the services
rem “%SAFEBIN%\sleep” 10
if %res% == 0 goto end
“%SAFE%\safekit” printe "stop_prim failed"
:end
Internal files of the Linux mirror.safe module
userconfig.xml on Linux (description in the User's Guide)<!DOCTYPE safe>
<safe>
<service mode="mirror" defaultprim="alone" maxloop="3" loop_interval="24" failover="on">
<!-- Server Configuration -->
<!-- Names or IP addresses on the default network are set during initialization in the console -->
<heart pulse="700" timeout="30000">
<heartbeat name=”default” ident=”flow”/>
</heart>
<!-- Software Error Detection Configuration -->
<!-- Replace
* PROCESS_NAME by the name of the process to monitor
-->
<errd polltimer="10">
<proc name="PROCESS_NAME" atleast="1" action="restart" class="prim" />
</errd>
<!-- File Replication Configuration -->
<rfs mountover="off" async="second" acl="off" nbrei="3" >
<replicated dir="/test1replicated" mode="read_only"/>
<replicated dir="/test2replicated" mode="read_only"/>
</rfs>
<!-- User scripts activation -->
<user nicestoptimeout="300" forcestoptimeout="300" logging="userlog"/>
</service>
</safe>
start_prim on Linux
#!/bin/sh
# Script called on the primary server for starting application
# For logging into SafeKit log use:
# $SAFE/safekit printi | printe "message"
# stdout goes into Application log
echo "Running start_prim $*"
res=0
# Fill with your application start call
if [ $res -ne 0 ] ; then
$SAFE/safekit printe "start_prim failed"
# uncomment to stop SafeKit when critical
# $SAFE/safekit stop -i "start_prim"
fi
stop_prim on Linux
#!/bin/sh
# Script called on the primary server for stopping application
# For logging into SafeKit log use:
# $SAFE/safekit printi | printe "message"
#----------------------------------------------------------
#
# 2 stop modes:
#
# - graceful stop
# call standard application stop
#
# - force stop ($1=force)
# kill application's processes
#
#----------------------------------------------------------
# stdout goes into Application log
echo "Running stop_prim $*"
res=0
# default: no action on forcestop
[ "$1" = "force" ] && exit 0
# Fill with your application stop call
[ $res -ne 0 ] && $SAFE/safekit printe "stop_prim failed"
Discover SafeKit in Google GCP
Network load balancing and failover |
|
Windows farm | Linux farm |
Generic Windows farm > | Generic Linux farm > |
Microsoft IIS > | - |
NGINX > | |
Apache > | |
Amazon AWS farm > | |
Microsoft Azure farm > | |
Google GCP farm > | |
Other cloud > |
Advanced clustering architectures
Several modules can be deployed on the same cluster. Thus, advanced clustering architectures can be implemented:
- the farm+mirror cluster built by deploying a farm module and a mirror module on the same cluster,
- the active/active cluster with replication built by deploying several mirror modules on 2 servers,
- the Hyper-V cluster or KVM cluster with real-time replication and failover of full virtual machines between 2 active hypervisors,
- the N-1 cluster built by deploying N mirror modules on N+1 servers.
Evidian SafeKit mirror cluster with real-time file replication and failover |
|
3 products in 1 More info > |
|
Very simple configuration More info > |
|
Synchronous replication More info > |
|
Fully automated failback More info > |
|
Replication of any type of data More info > |
|
File replication vs disk replication More info > |
|
File replication vs shared disk More info > |
|
Remote sites and virtual IP address More info > |
|
Quorum and split brain More info > |
|
Active/active cluster More info > |
|
Uniform high availability solution More info > |
|
RTO / RPO More info > |
|
Evidian SafeKit farm cluster with load balancing and failover |
|
No load balancer or dedicated proxy servers or special multicast Ethernet address |
|
All clustering features |
|
Remote sites and virtual IP address |
|
Uniform high availability solution |
|
Software clustering vs hardware clustering
|
|
|
|
Shared nothing vs a shared disk cluster |
|
|
|
Application High Availability vs Full Virtual Machine High Availability
|
|
|
|
High availability vs fault tolerance
|
|
|
|
Synchronous replication vs asynchronous replication
|
|
|
|
Byte-level file replication vs block-level disk replication
|
|
|
|
Heartbeat, failover and quorum to avoid 2 master nodes
|
|
|
|
Virtual IP address primary/secondary, network load balancing, failover
|
|
|
|
User's Guide
Application Modules
Release Notes
Presales documentation
Introduction
-
- Features
- Architectures
- Distinctive advantages
-
- Hardware vs software cluster
- Synchronous vs asynchronous replication
- File vs disk replication
- High availability vs fault tolerance
- Hardware vs software load balancing
- Virtual machine vs application HA
Installation, Console, CLI
- Install and setup / pptx
- Package installation
- Nodes setup
- Cluster configuration
- Upgrade
- Web console / pptx
- Cluster configuration
- Configuration tab
- Control tab
- Monitor tab
- Advanced Configuration tab
- Command line / pptx
- Silent installation
- Cluster administration
- Module administration
- Command line interface
Advanced configuration
- Mirror module / pptx
- userconfig.xml + restart scripts
- Heartbeat (<hearbeat>)
- Virtual IP address (<vip>)
- Real-time file replication (<rfs>)
- Farm module / pptx
- userconfig.xml + restart scripts
- Farm configuration (<farm>)
- Virtual IP address (<vip>)
- Checkers / pptx
- Failover machine (<failover>)
- Process monitoring (<errd>)
- Network and duplicate IP checkers
- Custom checker (<custom>)
- Split brain checker (<splitbrain>)
- TCP, ping, module checkers
Support
- Support tools / pptx
- Analyze snapshots
- Evidian support / pptx
- Get permanent license key
- Register on support.evidian.com
- Call desk