eviden-logo

Evidian > Products > High Availability Software - Zero Extra Hardware > Amazon AWS: The Simplest High Availability Cluster with Synchronous Real-Time Replication and Failover

Amazon AWS: The Simplest High Availability Cluster with Synchronous Real-Time Replication and Failover

Evidian SafeKit

How the Evidian SafeKit software simply implements a high availability cluster in Amazon AWS?

The solution in Amazon AWS

Evidian SafeKit brings high availability in Amazon AWS between two Windows or Linux redundant servers.

This article explains how to implement quickly a Amazon AWS cluster without shared disk and without specific skills.

A generic product

Note that SafeKit is a generic product on Windows and Linux.

You can implement with the same product real-time replication and failover of any file directory and service, database, complete Hyper-V or KVM virtual machines, Docker, Kubernetes, Cloud applications (see all solutions).

Architecture

SafeKit mirror cluster with real-time replication and failover in Amazon AWS?

How it works in Amazon AWS?

  • The servers are running in different availability zones.
  • The critical application is running on the PRIM server.
  • Users are connected to a primary/secondary virtual IP address which is configured in the Amazon AWS load balancer.
  • SafeKit provides a generic health check for the load balancer.
    On the PRIM server, the health check returns OK to the load balancer and NOK on the SECOND server.
  • In each server, SafeKit monitors the critical application with process checkers and custom checkers.
  • SafeKit restarts automatically the critical application when there is a software failure or a hardware failure thanks to restart scripts.
  • SafeKit makes synchronous real-time replication of files containing critical data.
  • A connector for the SafeKit web console is installed in each server.
    Thus, the high availability cluster can be managed in a very simple way to avoid human errors.

How the SafeKit mirror cluster works?

Step 1. Real-time replication

Server 1 (PRIM) runs the application. Clients are connected to a virtual IP address. SafeKit replicates in real time modifications made inside files through the network. 

File replication at byte level in a mirror cluster

The replication is synchronous with no data loss on failure contrary to asynchronous replication.

You just have to configure the names of directories to replicate in SafeKit. There are no pre-requisites on disk organization. Directories may be located in the system disk.

Step 2. Automatic failover

When Server 1 fails, Server 2 takes over. SafeKit switches the virtual IP address and restarts the application automatically on Server 2.

The application finds the files replicated by SafeKit uptodate on Server 2. The application continues to run on Server 2 by locally modifying its files that are no longer replicated to Server 1.

Failover in a mirror cluster

The failover time is equal to the fault-detection time (30 seconds by default) plus the application start-up time.

Step 3. Automatic failback

Failback involves restarting Server 1 after fixing the problem that caused it to fail.

SafeKit automatically resynchronizes the files, updating only the files modified on Server 2 while Server 1 was halted.

Failback in a mirror cluster

Failback takes place without disturbing the application, which can continue running on Server 2.

Step 4. Back to normal

After reintegration, the files are once again in mirror mode, as in step 1. The system is back in high-availability mode, with the application running on Server 2 and SafeKit replicating file updates to Server 1.

Return to normal operation in a mirror cluster

If the administrator wishes the application to run on Server 1, he/she can execute a "swap" command either manually at an appropriate time, or automatically through configuration.

Typical usage with SafeKit

Why a replication of a few Tera-bytes?

Resynchronization time after a failure (step 3)

  • 1 Gb/s network ≈ 3 Hours for 1 Tera-bytes.
  • 10 Gb/s network ≈ 1 Hour for 1 Tera-bytes or less depending on disk write performances.

Alternative

Why a replication < 1,000,000 files?

  • Resynchronization time performance after a failure (step 3).
  • Time to check each file between both nodes.

Alternative

  • Put the many files to replicate in a virtual hard disk / virtual machine.
  • Only the files representing the virtual hard disk / virtual machine will be replicated and resynchronized in this case.

Why a failover ≤ 32 replicated VMs?

  • Each VM runs in an independent mirror module.
  • Maximum of 32 mirror modules running on the same cluster.

Alternative

  • Use an external shared storage and another VM clustering solution.
  • More expensive, more complex.

Why a LAN/VLAN network between remote sites?

Alternative

  • Use a load balancer for the virtual IP address if the 2 nodes are in 2 subnets (supported by SafeKit, especially in the cloud).
  • Use backup solutions with asynchronous replication for high latency network.

SafeKit High Availability Differentiators