Microsoft Azure: The Simplest High Availability Cluster with Synchronous Real-Time Replication and Failover
Evidian SafeKit
The solution in Microsoft Azure
Evidian SafeKit brings high availability in Microsoft Azure between two Windows or Linux redundant servers.
This article explains how to implement quickly a Microsoft Azure cluster without shared disk and without specific skills.
A generic product
Note that SafeKit is a generic product on Windows and Linux.
You can implement with the same product real-time replication and failover of any file directory and service, database, complete Hyper-V or KVM virtual machines, Docker, Kubernetes , Cloud applications.
Architecture
How it works in Microsoft Azure?
- The servers are running in different availability zones.
- The critical application is running on the PRIM server.
- Users are connected to a primary/secondary virtual IP address which is configured in the Microsoft Azure load balancer.
- SafeKit provides a generic health check for the load balancer.
On the PRIM server, the health check returns OK to the load balancer and NOK on the SECOND server. - In each server, SafeKit monitors the critical application with process checkers and custom checkers.
- SafeKit restarts automatically the critical application when there is a software failure or a hardware failure thanks to restart scripts.
- SafeKit makes synchronous real-time replication of files containing critical data.
- A connector for the SafeKit web console is installed in each server.
Thus, the high availability cluster can be managed in a very simple way to avoid human errors.
Partners, the success with SafeKit
This platform agnostic solution is ideal for a partner reselling a critical application and who wants to provide a redundancy and high availability option easy to deploy to many customers.
With many references in many countries won by partners, SafeKit has proven to be the easiest solution to implement for redundancy and high availability of building management, video management, access control, SCADA software...
Building Management Software (BMS)
Video Management Software (VMS)
Electronic Access Control Software (EACS)
SCADA Software (Industry)
Azure quick start template for a one-click deployment of a mirror cluster on Windows or Linux
The Evidian SafeKit mirror cluster has been validated by Microsoft Azure in quickstart templates.
To deploy the Evidian SafeKit high availability cluster with replication and failover in Microsoft Azure between two redundant Windows or Linux servers, just click on the following button which deploys everything:
Azure quick start template for a mirror cluster (Windows or Linux) >
Go to the Template Guide tab for more information
Automatic installation in Microsoft Azure of a high availability cluster with synchronous replication and failover (Windows or Linux)
- Automatic deployment
- Configure deployment
- After deployment
- Video of the deployment
- Access VMs in Microsoft Azure
- Deployed resources
Automatic deployment of a Microsoft Azure template
To deploy the Evidian SafeKit high availability cluster with replication and failover in Microsoft Azure, just click on the following button which deploys everything:
Deploy to Azure a Mirror Cluster (Windows or Linux) >
Configure the Microsoft Azure template
After the click:
- in "Resource group", click on "Create new" and set a name
- choose the geographical "Location" where the cluster will be deployed
- choose the "OS" Windows or Linux
- choose an "Admin User" name (not Administrator, not root)
- choose an "Admin Password". Passwords must be between 12 and 72 characters and have 3 of the following: 1 lower case, 1 upper case, 1 number, and 1 special character.
- click on "I agree..." and then on "Purchase" (no fee on SafeKit free trial, only on Microsoft Azure infrastructure)
- wait the end of deployment of the real-time replication and failover cluster
After deployment
After deployment, click on 'Microsoft.Template' (previous image), then go to the output panel and
- visit the credential URL to install the client and CA certificates in your web browser. Force the load of the unsafe page. Put as user 'CA_admin' and the password you enter during the template configuration. Be careful, put the second certificate in the 'Trusted Root Certification Authority' store
- after certificates installation, start the web console of the cluster
- test the primary/secondary virtual IP address with the test URL in the output. A primary/secondary load balancing rule has been set for external port 9453, internal port 9453. The URL returns the name of the PRIM or ALONE server
Video of the Microsoft Azure mirror template deployment
Accessing the VMs through SSH (Linux) or remote desktop (Windows)
If you want to connect to Virtual Machines through SSH (Linux) or remote desktop (Windows), you can use the SafeKit web console to know IP addresses or DNS names of VMs (next images). Use the user/password entered during the template configuration for accessing the VMs.
Deployed resources
In term of VMs, this template deploys:
- 2 VMs (Windows or Linux)
- each VM has a public IP address
- the SafeKit free trial is installed in both VMs
- a SafeKit mirror module is configured in both VMs
In term of load balancer, this template deploys:
- a public load balancer
- a public IP is associated with the public load balancer and plays the role of the virtual IP
- both VMs are in the backend pool of the load balancer
- a health probe checks the mirror module state on both VMs
- a load balancing rule for external port 9453 / internal port 9453 is set to test the primary/secondary virtual IP
For a manual installation, go to the Manual Installation tab
Manual installation in Microsoft Azure of a high availability cluster with synchronous replication and failover (Windows or Linux)
- Configuration of the Microsoft Azure load balancer
- Configuration of the Microsoft Azure network security
- Package installation on Windows
- Package installation on Linux
- Configuration of SafeKit
- Tests
Configuration of the Microsoft Azure load balancer
The load balancer must be configured to periodically send health packets to virtual machines. For that, SafeKit provides a health probe which runs inside the virtual machines and which
- returns OK when the mirror module state is PRIM (green) or ALONE (green)
- returns NOT FOUND in all other states
You must configure the Microsoft Azure load balancer with:
- HTTP protocol
- port 9010, the SafeKit web server port
- URL /var/modules/mirror/ready.txt (if mirror is the module name that you will deploy later)
For more information, see the configuration of the Microsoft Azure load balancer.
Configuration of the Microsoft Azure network security
The network security must be configured to enable communications for the following protocols and ports:
- UDP - 4800 for the safeadmin service (between SafeKit nodes)
- UDP - 8888 for the module heartbeat (between SafeKit nodes)
- TCP – 5600 for the module real time file replication (between SafeKit nodes)
- TCP – 9010 for the load-balancer health probe and for the SafeKit web console running in the http mode
- TCP – 9001 to configure the https mode for the console
- TCP – 9453 for the SafeKit web console running in https mode
Package installation on Windows
On both Windows servers
- Install the free version of SafeKit for Cloud (click here) on 2 Windows nodes
- The module mirror.safe is delivered inside the package.
- To open firewall, start a command line as administrator, goto C:\safekit\private\bin and type .\firewallcfg.cmd add on both nodes
Package installation on Linux
On both Linux servers
- Install the free version of SafeKit for Cloud (click here) on 2 Linux nodes
- After the download of safekit_xx.bin package, execute it to extract the rpm and the safekitinstall script and then execute the safekitinstall script
- Answer yes to firewall automatic configuration
- The module mirror.safe is delivered inside the package.
Configuration of SafeKit
The configuration is presented with the web console connected to 2 Windows servers but it is the same thing with 2 Linux servers.
Important: all the configuration is made from a single browser.
It is recommended to configure the web console in the https mode by connecting to https://<IP address of 1 VM>:9453 (next image). In this case, you must configure before the https mode by using the wizard described in the User's Guide: see "11. Securing the SafeKit web console".
Or you can use the web console in the http mode by connecting to http://<IP address of 1 VM>:9010 (next image).
Note that you can also make a configuration with DNS names, especially if the IP addresses are not static.
Enter IP address of the first node and click on Confirm (next image)
Click on New node and enter IP address of the second node (next image)
Click on the red floppy disk to save the configuration (previous image)
In the Configuration tab, click on mirror.safe then enter mirror as the module name and Confirm: next images with mirror instead of xxx
Click on Validate (next image)
Change the path of replicated directories only if necessary (next image).
Do not configure a virtual IP address (next image) because this configuration is already made in the Microsoft Azure load balancer. This section is useful for on-premise configuration only.
If a process is defined in the Process Checker section (next image), it will be monitored on the primary server with the action restart in case of failure. The services will be stopped an restarted locally on the primary server if this process disappears from the list of running processes. After 3 unsuccessful local restarts, the module is stopped on the local server and there is a failover on the secondary server. As a consequence, the health probe answers OK on the new primary server to the Microsoft Azure load balancer and the virtual IP address traffic is switched to the new primary server.
start_prim and stop_prim (next image) contain the start and the stop of services.
Note:
- on Windows, put services with Boot Startup Type = Manual on both servers (SafeKit controls start of services in start_prim).
Click on Validate (previous image)
Click on Configure (previous image)
Check the success green message on both servers and click on Next (previous image). On Linux, you may have an error at this step if replicated directories are mount points. See this article to solve the problem.
Select the node with the most up-to-date replicated directories and click on start it to make the first resynchronization in the right direction (previous image). Before this operation, we suggest you to make a copy of replicated directories before starting the cluster to avoid any errors.
Start the second node (previous image) which becomes SECOND green (next image) after resynchronisation of all replicated directories (binary copy from node 1 to node 2).
The cluster is operational with services running on the PRIM node and nothing running on the SECOND node (previous image). Only modifications inside files are replicated in real-time in this state.
Be careful, components which are clients of the services must be configured with the virtual IP address. The configuration can be made with a DNS name (if a DNS name has been created and associated with the virtual IP address).
Tests
Check with Windows Microsoft Management Console (MMC) or with Linux command lines that the services are started on the primary server and stopped on the secondary server.
Stop the PRIM node by scrolling down the menu of the primary node and by clicking on Stop. Check that there is a failover on the SECOND node. And check the failover of services with Windows Microsoft Management Console (MMC) or with Linux command lines.
More information on tests in the User's Guide
Automatic start of the module at boot
Configure boot start (next image on the right side) configures the automatic boot of the module when the server boots. Do this configuration on both servers once the high availability solution is correctly running.
Note that on Windows, with Windows services manager, we assume that services are with Boot Startup Type = Manual on both nodes. SafeKit controls start of services when starting the module in start_prim.
Note that for synchronizing SafeKit at boot and at shutdown on Windows, we assume that a command line has been run on both nodes during installation as administrator: .\addStartupShutdown.cmd in C:\safekit\private\bin (otherwise dot it now).
For reading the SafeKit logs, go to the Troubleshooting tab
For editing userconfig.xml, start_prim and stop_prim, go to the Advanced Configuration tab
Troubleshooting with the SafeKit module and application logs
Module log
Read the module log to understand the reasons of a failover, of a waiting state on the availability of a resource etc...
To see the module log of the primary server (next image):
- click on the Control tab
- click on node 1/PRIM (it becomes blue) on the left side to select the server
- click on Module Log
- click on the Refresh icon (green arrows) to update the console
- click on the floppy disk to save the module log in a .txt file and to analyze in a text editor
Repeat the same operation to see the module log of the secondary server.
Application log
Read the application log to see the output messages of the stat_prim and stop_prim restart scripts.
To see the application log of the primary server (next image):
- click on the Control tab
- click on node 1/PRIM (it becomes blue) on the left side to select the server
- click on Application Log to see messages when starting and stopping services
- click on the Refresh icon (green arrows) to update the console
- click on the floppy disk to save the application log in a .txt file and to analyze in a text editor
Repeat the same operation to see the application log of the secondary server.
More information on troubleshooting in the User's Guide
For support, open the Support section
Advanced configuration of SafeKit for implementing the high availability cluster
In Advanced Configuration tab (next image), you can edit internal files of the module: bin/start_prim and bin/stop_prim and conf/userconfig.xml (next image on the left side). If you make change in the internal files here, you must apply the new configuration by a right click on the icon/xxx on the left side (next image): the interface will allow you to redeploy the modified files on both servers.
Training and documentation here
For an example of userconfig.xml, start_prim and stop_prim, open the Internals section below
Support of SafeKit
For getting support on the call desk of https://support.evidian.com, get 2 Snaphots (2 .zip files), one for each server and upload them in the call desk tool (next image).
Internals of a SafeKit / Microsoft Azure high availability cluster with synchronous replication and failover
Go to the Advanced Configuration tab, for editing these filesInternal files of the Windows mirror.safe module
userconfig.xml on Windows (description in the User's Guide)<!DOCTYPE safe>
<safe>
<service mode="mirror" defaultprim="alone" maxloop="3" loop_interval="24" failover="on">
<!-- Server Configuration -->
<!-- Names or IP addresses on the default network are set during initialization in the console -->
<heart pulse="700" timeout="30000">
<heartbeat name="default" ident="flow"/>
</heart>
<!-- Software Error Detection Configuration -->
<!-- Replace
* PROCESS_NAME by the name of the process to monitor
-->
<errd polltimer="10">
<proc name="PROCESS_NAME" atleast="1" action="restart" class="prim" />
</errd>
<!-- File Replication Configuration -->
<rfs async="second" acl="off" nbrei="3">
<replicated dir="c:\test1replicated" mode="read_only"/>
<replicated dir="c:\test2replicated" mode="read_only"/>
</rfs>
<!-- User scripts activation -->
<user nicestoptimeout="300" forcestoptimeout="300" logging="userlog"/>
</service>
</safe>
start_prim.cmd on Windows
@echo off
rem Script called on the primary server for starting application services
rem For logging into SafeKit log use:
rem “%SAFE%\safekit” printi | printe "message"
rem stdout goes into Application log
echo "Running start_prim %*"
set res=0
rem Fill with your services start call
set res=%errorlevel%
if %res% == 0 goto end
:stop
“%SAFE%\safekit” printe "start_prim failed"
rem uncomment to stop SafeKit when critical
rem “%SAFE%\safekit” stop -i "start_prim"
:end
stop_prim.cmd on Windows</>
@echo off
rem Script called on the primary server for stopping application services
rem For logging into SafeKit log use:
rem “%SAFE%\safekit” printi | printe "message"
rem ----------------------------------------------------------
rem
rem 2 stop modes:
rem
rem - graceful stop
rem call standard application stop with net stop
rem
rem - force stop (%1=force)
rem kill application's processes
rem
rem ----------------------------------------------------------
rem stdout goes into Application log
echo "Running stop_prim %*"
set res=0
rem default: no action on forcestop
if "%1" == "force" goto end
rem Fill with your services stop call
rem If necessary, uncomment to wait for the stop of the services
rem “%SAFEBIN%\sleep” 10
if %res% == 0 goto end
“%SAFE%\safekit” printe "stop_prim failed"
:end
Internal files of the Linux mirror.safe module
userconfig.xml on Linux (description in the User's Guide)<!DOCTYPE safe>
<safe>
<service mode="mirror" defaultprim="alone" maxloop="3" loop_interval="24" failover="on">
<!-- Server Configuration -->
<!-- Names or IP addresses on the default network are set during initialization in the console -->
<heart pulse="700" timeout="30000">
<heartbeat name=”default” ident=”flow”/>
</heart>
<!-- Software Error Detection Configuration -->
<!-- Replace
* PROCESS_NAME by the name of the process to monitor
-->
<errd polltimer="10">
<proc name="PROCESS_NAME" atleast="1" action="restart" class="prim" />
</errd>
<!-- File Replication Configuration -->
<rfs mountover="off" async="second" acl="off" nbrei="3" >
<replicated dir="/test1replicated" mode="read_only"/>
<replicated dir="/test2replicated" mode="read_only"/>
</rfs>
<!-- User scripts activation -->
<user nicestoptimeout="300" forcestoptimeout="300" logging="userlog"/>
</service>
</safe>
start_prim on Linux
#!/bin/sh
# Script called on the primary server for starting application
# For logging into SafeKit log use:
# $SAFE/safekit printi | printe "message"
# stdout goes into Application log
echo "Running start_prim $*"
res=0
# Fill with your application start call
if [ $res -ne 0 ] ; then
$SAFE/safekit printe "start_prim failed"
# uncomment to stop SafeKit when critical
# $SAFE/safekit stop -i "start_prim"
fi
stop_prim on Linux
#!/bin/sh
# Script called on the primary server for stopping application
# For logging into SafeKit log use:
# $SAFE/safekit printi | printe "message"
#----------------------------------------------------------
#
# 2 stop modes:
#
# - graceful stop
# call standard application stop
#
# - force stop ($1=force)
# kill application's processes
#
#----------------------------------------------------------
# stdout goes into Application log
echo "Running stop_prim $*"
res=0
# default: no action on forcestop
[ "$1" = "force" ] && exit 0
# Fill with your application stop call
[ $res -ne 0 ] && $SAFE/safekit printe "stop_prim failed"
Discover SafeKit in Google GCP
Network load balancing and failover |
|
Windows farm | Linux farm |
Generic Windows farm > | Generic Linux farm > |
Microsoft IIS > | - |
NGINX > | |
Apache > | |
Amazon AWS farm > | |
Microsoft Azure farm > | |
Google GCP farm > | |
Other cloud > |
Advanced clustering architectures
Several modules can be deployed on the same cluster. Thus, advanced clustering architectures can be implemented:
- the farm+mirror cluster built by deploying a farm module and a mirror module on the same cluster,
- the active/active cluster with replication built by deploying several mirror modules on 2 servers,
- the Hyper-V cluster or KVM cluster with real-time replication and failover of full virtual machines between 2 active hypervisors,
- the N-1 cluster built by deploying N mirror modules on N+1 servers.
Evidian SafeKit mirror cluster with real-time file replication and failover |
|
3 products in 1 More info > |
|
Very simple configuration More info > |
|
Synchronous replication More info > |
|
Fully automated failback More info > |
|
Replication of any type of data More info > |
|
File replication vs disk replication More info > |
|
File replication vs shared disk More info > |
|
Remote sites and virtual IP address More info > |
|
Quorum and split brain More info > |
|
Active/active cluster More info > |
|
Uniform high availability solution More info > |
|
RTO / RPO More info > |
|
Evidian SafeKit farm cluster with load balancing and failover |
|
No load balancer or dedicated proxy servers or special multicast Ethernet address |
|
All clustering features |
|
Remote sites and virtual IP address |
|
Uniform high availability solution |
|
Software clustering vs hardware clustering
|
|
|
|
Shared nothing vs a shared disk cluster |
|
|
|
Application High Availability vs Full Virtual Machine High Availability
|
|
|
|
High availability vs fault tolerance
|
|
|
|
Synchronous replication vs asynchronous replication
|
|
|
|
Byte-level file replication vs block-level disk replication
|
|
|
|
Heartbeat, failover and quorum to avoid 2 master nodes
|
|
|
|
Virtual IP address primary/secondary, network load balancing, failover
|
|
|
|
User's Guide
Application Modules
Release Notes
Presales documentation
Introduction
-
- Features
- Architectures
- Distinctive advantages
-
- Hardware vs software cluster
- Synchronous vs asynchronous replication
- File vs disk replication
- High availability vs fault tolerance
- Hardware vs software load balancing
- Virtual machine vs application HA
Installation, Console, CLI
- Install and setup / pptx
- Package installation
- Nodes setup
- Cluster configuration
- Upgrade
- Web console / pptx
- Cluster configuration
- Configuration tab
- Control tab
- Monitor tab
- Advanced Configuration tab
- Command line / pptx
- Silent installation
- Cluster administration
- Module administration
- Command line interface
Advanced configuration
- Mirror module / pptx
- userconfig.xml + restart scripts
- Heartbeat (<hearbeat>)
- Virtual IP address (<vip>)
- Real-time file replication (<rfs>)
- Farm module / pptx
- userconfig.xml + restart scripts
- Farm configuration (<farm>)
- Virtual IP address (<vip>)
- Checkers / pptx
- Failover machine (<failover>)
- Process monitoring (<errd>)
- Network and duplicate IP checkers
- Custom checker (<custom>)
- Split brain checker (<splitbrain>)
- TCP, ping, module checkers
Support
- Support tools / pptx
- Analyze snapshots
- Evidian support / pptx
- Get permanent license key
- Register on support.evidian.com
- Call desk