Kubernetes K3S: the simplest high availability cluster between two redundant servers

The solution for Kubernetes K3S

Evidian SafeKit brings high availability to Kubernetes K3S between two redundant servers. This article explains how to implement quickly a Kubernetes cluster on 2 nodes without NFS external storage, without an external configuration database and without specific skills.

Note that SafeKit is a generic product. You can implement with the same product real-time replication and failover of directories and services, databases, Docker, Podman, full Hyper-V or KVM virtual machines, Cloud applications (see the module list).

This clustering solution is recognized as the simplest to implement by our customers and partners. The SafeKit solution is the perfect solution for running Kubernetes applications on premise and on 2 nodes.

We have chosen K3S as the Kubernetes engine because it is a lightweight solution for IoT & Edge computing.

The k3s.safe mirror module implements:

2 active K3S masters/agents running pods
replication of the K3S configuration database (MariaDB)
replication of persistent volumes (implemented by NFS client dynamic provisionner storage class: nfs-client)
virtual IP address, automatic failover, automatic failback

How it works?

The following table explains how the solution is working on 2 nodes. Other nodes with K3S agents (without SafeKit) can be added for horizontal scalability.

Kubernetes K3S components
SafeKit PRIM node	SafeKit SECOND node
K3S (master and agent) is running pods on the primary node	K3S (master and agent) is running pods on the secondary node
NFS Server is running on the primary node with: a virtual IP/NFS port exported NFS share K3S persistent volumes	Persistent volumes are replicated synchronously and in real-time by SafeKit on the secondary node
MariaDB server is running on the primary node with: a virtual IP/MariaDB port K3S configuration database	The configuration database is replicated synchronously and in real-time by SafeKit on the secondary node

A simple solution

SafeKit is the simplest high availabiliy solution for running Kubernetes applications on 2 nodes and on premise.

SafeKit	Benefits
Synchronous real-time replication for persistent volumes	No external NAS/NFS storage for persistent volumes
Only 2 nodes for HA of Kubernetes	No need for 3 nodes like with etcd database
Same simple product for virtual IP address, replication, failover, failback, administration, maintenance	Avoid different technologies for virtual IP (metal-lb, BGP), HA of persistent volumes, HA of configuration database
Supports disaster recovery with two remote nodes	Avoid replicated NAS storage

Step 1. File replication at byte level in a mirror cluster

This step corresponds to the following figure. Server 1 (PRIM) runs the Kubernetes K3S components explained in the previous table. Clients are connected to the virtual IP address of the mirror cluster. SafeKit replicates in real time files opened by the Kubernetes K3S components. Only changes made by the components in the files are replicated across the network, thus limiting traffic (byte-level file replication).

With a software data replication at the file level, only names of directories are configured in SafeKit. There are no pre-requisites on disk organization for the two servers. Directories to replicate may be located in the system disk. SafeKit implements synchronous replication with no data loss on failure contrary to asynchronous replication.

Step 2. Failover

When Server 1 fails, Server 2 takes over. SafeKit switches the cluster's virtual IP address and restarts the Kubernetes K3S components automatically on Server 2. The components find the files replicated by SafeKit uptodate on Server 2, thanks to the synchronous replication between Server 1 and Server 2. The components continue to run on Server 2 by locally modifying their files that are no longer replicated to Server 1.

The failover time is equal to the fault-detection time (set to 30 seconds by default) plus the components start-up time. Unlike disk replication solutions, there is no delay for remounting file system and running file system recovery procedures.

Step 3. Failback and reintegration

Failback involves restarting Server 1 after fixing the problem that caused it to fail. SafeKit automatically resynchronizes the files, updating only the files modified on Server 2 while Server 1 was halted. This reintegration takes place without disturbing the Kubernetes K3S components, which can continue running on Server 2.

If SafeKit was cleanly stopped on server 1, then at its restart, only the modified zones inside files are resynchronized, according to modification tracking bitmaps.

If server 1 crashed (power off), the modification bitmaps are not reliable and not used. All the files bearing a modification timestamp more recent than the last known synchronization point are resynchronized.

Step 4. Return to byte-level file replication in the mirror cluster

After reintegration, the files are once again in mirror mode, as in step 1. The system is back in high-availability mode, with the Kubernetes K3S components running on Server 2 and SafeKit replicating file updates to the secondary Server 1.

If the administrator wishes the Kubernetes K3S components to run on Server 1, he/she can execute a "swap" command either manually at an appropriate time, or automatically through configuration.

SafeKit quick installation guide for Kubernetes K3S

2. First on both nodes

On 2 Linux Ubuntu 20.04 nodes, as root:

Make sure the node has internet access (could be through a proxy)
Copy k3sconfig.sh, k3s.safe and the safekit_xx.bin package into a directory and cd into it
Rename the .bin file as "safekit.bin"
Make sure k3sconfig.sh and safekit.bin are executable.
Edit the k3sconfig.sh script and customize the environment variables according to your environment (including a virtual IP)
Execute on both nodes: ./k3sconfig.sh prereq

The script will:

Install required debian packages: alien, nfs-kernel-server, nfs-common, mariadb-server
Secure mariadb installation
Create directories for file replication
Prepare the NFS server for sharing replicated directories
Install SafeKit

3. On the first node

Execute on the first node: ./k3sconfig.sh first

The script will:

Create the K3S configuration database and the k3s user
Create the replicated storage volume file (sparse file) and format it as an xfs filesystem
Create the safekit cluster configuration and apply it
Install and configure the k3s.safe module on the cluster
Start the k3s module as "prim" on the first node
Download, install and start k3s
Download and install nfs-subdir-external-provisioner Helm chart
Display K3S token (to be used during second node installation phase)


/opt/safekit/safekit –H "*" state
---------------- Server=http://10.0.0.20:9010 ----------------
admin action=exec
--------------------- k3s State ---------------------

  Local  (127.0.0.1)    : PRIM (Service : Available)(Color : Green)
Success
---------------- Server=http://10.0.0.21:9010 ----------------
admin action=exec
--------------------- k3s State ---------------------

  Local  (127.0.0.1)    : SECOND (Service : Available)(Color : Green)
Success

7. Testing

Stop the PRIM node by scrolling down its contextual menu and clicking Stop.
Verify that there is a failover on the SECOND node which should become ALONE (green).
And with command lines on Linux, check the failover of services (stopped on node 1 in the stop_prim script and started on node 2 in the start_prim script). MariaDB and K3S should run on node2.

If ALONE (green) is not reached on node2, analyze why with the module log of node 2.

click on node2 to display the module log.
example of a SQL Server module log where the service name in start_prim is invalid. The sqlserver.exe process is monitored but as it is not started, at the end the module stops.

If everything is okay, initiate a start on node1, which will resynchronize the replicated directories from node2.

If things go wrong, stop node2 and force the start as primary of node1, which will restart with its locally healthy data at the time of the stop.

More information on tests in the User's Guide.

8. Try the cluster with a Kubernetes application like WordPress

You have the example of a WordPress installation in the image: a web portal with a backend database implemented by pods.

You can deploy your own application in the same way.

WordPress is automatically highly available:

with its data (php + database) in persistent volumes replicated in real-time by SafeKit
with a virtual IP address to access the WordPress site for users
with automatic failover and automatic failback

Notes:

The WordPress chart defines a load balanced service that listens on <service.port> and <service.httpsport> ports.
WordPress is accessible through the url: http://<virtual-ip>:<service.port>.
The virtual IP is managed by SafeKit and automatically switched in case of failover.
By default, K3S implements load balancers with Klipper.
Klipper listens on <virtual ip>:<service.port> and routes the TCP/IP packets to the IP address and port of the WordPress pod that it has selected.

$ export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
$ helm repo add bitnami https://charts.bitnami.com/bitnami
$ helm install my-release bitnami/wordpress --set global.storageClass=nfs-client --set service.port=8099,service.httpsPort=4439

10. If necessary, configure a splitbrain checker

See below "What are the different scenarios in case of network isolation in a cluster?" to know if you need to configure a splitbrain checker.
In the module configuration, click on Advanced Configuration (see image) to edit userconfig.xml.

Declare the splitbrain checker by adding in the <check> section of userconfig.xml:

<service>
 ...
 <check>
  ...
  <splitbrain ident="witness" exec="ping" arg="witness IP"/>
 </check>

Save and apply the new configuration to redeploy the modified userconfig.xml file on both nodes (module must be stopped on both nodes to save and apply).

Parameters:

ident="witness" identifies the witness with a resource name: splitbrain.witness. You can change this value to identify the witness.
exec="ping" references the ping code to execute. Do not change this value.
arg="witness IP" is an argument for the ping. Change this value with the IP of the witness (a robust element, typically a router).

When there is a network isolation, the behavior with a split-brain checker is:

a split-brain checker has been configured with the IP address of a witness (typically a router),
the split-brain checker operates when a server goes from PRIM to ALONE or from SECOND to ALONE,
in case of network isolation, before going to ALONE, both nodes test the IP address,
the node which can access the IP address goes to ALONE, the other one goes to WAIT,
when the isolation is repaired, the WAIT node resynchronizes its data and becomes SECOND.

Note: If the witness is down or disconnected, both nodes go to WAIT and the application is no more running. That's why you must choose a robust witness like a router.

Advanced clustering architectures

Several modules can be deployed on the same cluster. Thus, advanced clustering architectures can be implemented:

the farm+mirror cluster built by deploying a farm module and a mirror module on the same cluster,
the active/active cluster with replication built by deploying several mirror modules on 2 servers,
the Hyper-V cluster or KVM cluster with real-time replication and failover of full virtual machines between 2 active hypervisors,
the N-1 cluster built by deploying N mirror modules on N+1 servers.

Key differentiators of a mirror cluster with replication and failover

Evidian SafeKit mirror cluster with real-time file replication and failover
3 products in 1 More info >	The SafeKit high availability software saves on Windows and Linux the cost of : external shared or replicated storage, load balancing boxes, enterprise editions of OS and databases SafeKit includes all clustering features: synchronous real-time file replication, monitoring of server / network / software failures, automatic application restart, virtual IP address switched in case of failure to reroute clients
Very simple configuration More info >	The cluster configuration is very simple and made by means of application modules. New services and new replicated directories can be added to an existing application module to complete a high availability solution All the configuration of clusters is made using a simple centralized web administration console There is no domain controller or active directory to configure as with Microsoft cluster
Synchronous replication More info >	The real-time replication is synchronous with no data loss on failure This is not the case with asynchronous replication
Fully automated failback More info >	After a failure when a server reboots, the replication failback procedure is fully automatic and the failed server reintegrates the cluster without stopping the application on the only remaining server This is not the case with most replication solutions particularly with replication at the database level. Manual operations are required for resynchronizing a failed server. The application may even be stopped on the only remaining server during the resynchonization of the failed server
Replication of any type of data More info >	The replication is working for databases but also for any files which shall be replicated This not the case for replication at the database level
File replication vs disk replication More info >	The replication is based on file directories that can be located anywhere (even in the system disk) This is not the case with disk replication where special application configuration must be made to put the application data in a special disk
File replication vs shared disk More info >	The servers can be put in two remote sites This is not the case with shared disk solutions
Remote sites and virtual IP address More info >	All SafeKit clustering features are working for 2 servers in remote sites. Replication requires an extended LAN type network (latency = performance of synchronous replication, bandwidth = performance of resynchronization after failure). If both servers are connected to the same IP network through an extended LAN between two remote sites, the virtual IP address of SafeKit is working with rerouting at level 2 If both servers are connected to two different IP networks between two remote sites, the virtual IP address can be configured at the level of a load balancer with the "healh check" of SafeKit.
Quorum and split brain More info >	The solution works with only 2 servers and for the quorum (network isolation between both sites), a simple split brain checker to a router is offered to support a single execution of the critical application This is not the case for most clustering solutions where a 3^rd server is required for the quorum
Active/active cluster More info >	The secondary server is not dedicated to the restart of the primary server. The cluster can be active-active by running 2 different mirror modules This is not the case with a fault-tolerant system where the secondary is dedicated to the execution of the same application synchronized at the instruction level
Uniform high availability solution More info >	SafeKit implements a mirror cluster with replication and failover. But it imlements also a farm cluster with load balancing and failover. Thus a N-tiers architecture can be made highly available and load balanced with the same solution on Windows and Linux (same installation, configuration, administration with the SafeKit console or with the command line interface). This is unique on the market This is not the case with an architecture mixing different technologies for load balancing, replication and failover
RTO / RPO More info >	SafeKit implements quick application restart in case of failure: around 1 mn or less Quick application restart is not ensured with full virtual machines replication. In case of hypervisor failure, a full VM must be rebooted on a new hypervisor with a recovery time depending on the OS reboot as with VMware HA or Hyper-V cluster

Key differentiators of a farm cluster with load balancing and failover

Evidian SafeKit farm cluster with load balancing and failover
No load balancer or dedicated proxy servers or special multicast Ethernet address More info >	The solution does not require load balancers or dedicated proxy servers above the farm for imlementing load balancing. SafeKit is installed directly on the application servers in the farm. The load balancing is based on a standard virtual IP address/Ethernet MAC address and is working with physical servers or virtual machines on Windows and Linux without special network configuration This is not the case with network load balancers This is not the case with dedicated proxies on Linux This is not the case with a specific multicast Ethernet address on Windows
All clustering features More info >	The solution includes all clustering features: virtual IP address, load balancing on client IP address or on sessions, monitoring of server / network / software failures, automatic application restart with a quick revovery time and a replication option with a mirror module This is not the case with other load balancing solutions. They are able to make load balancing but they do not include a full clustering solution with restart scripts and automatic application restart in case of failure. They do not offer a replication option The cluster configuration is very simple and made by means of application modules. There is no domain controller or active directory to configure on Windows. The solution works on Windows and Linux
Remote sites and virtual IP address More info >	If servers are connected to the same IP network through an extended LAN between remote sites, the virtual IP address of SafeKit is working with load balancing at level 2 If servers are connected to different IP networks between remote sites, the virtual IP address can be configured at the level of a load balancer with the help of the SafeKit health check. Thus you can implement load balancing but also all the clustering features of SafeKit, in particular monitoring and automatic recovery of the critical application on application servers
Uniform high availability solution More info >	SafeKit imlements a farm cluster with load balancing and failover. But it implements also a mirror cluster with replication and failover. Thus a N-tiers architecture can be made highly available and load balanced with the same solution on Windows and Linux (same installation, configuration, administration with the SafeKit console or with the command line interface). This is unique on the market This is not the case with an architecture mixing different technologies for load balancing, replication and failover

Key differentiators of the SafeKit high availability technology

Software clustering vs hardware clustering More info >
A simple software cluster with the SafeKit package just installed on two servers	Complex hardware clustering with external storage or network load balancers
Shared nothing vs a shared disk cluster More info >
SafeKit is a shared-nothing cluster: easy to deploy even in remote sites	A shared disk cluster is complex to deploy
Application High Availability vs Full Virtual Machine High Availability More info >
Application HA supports hardware failure and software failure with a quick recovery time (RTO around 1 mn or less). Application HA requires to define restart scripts per application and folders to replicate (SafeKit application modules).	Full virtual machines HA supports only hardware failure with a VM reboot and a recovery time depending on the OS reboot. No restart scripts to define with full virtual machines HA (SafeKit hyperv.safe or kvm.safe modules). Hypervisors are active/active with just multiple virtual machines.
High availability vs fault tolerance More info >
No dedicated server with SafeKit. Each server can be the failover server of the other one. Software failure with restart in another OS environment. Smooth upgrade of application and OS possible server by server (version N and N+1 can coexist)	Secondary server dedicated to the execution of the same application synchronized at the instruction level. Software exception on both servers at the same time. Smooth upgrade not possible
Synchronous replication vs asynchronous replication More info >
SafeKit implements real-time synchronous replication with no data loss in case of failure	With asynchronous replication, there is data loss on failure
Byte-level file replication vs block-level disk replication More info >
SafeKit implements real-time byte-level file replication and is simply configured with application directories to replicate even in the system disk	Block-level disk replication is complex to configure and requires to put application data in a special disk
Heartbeat, failover and quorum to avoid 2 master nodes More info >
To avoid 2 masters, SafeKit proposes a simple split brain checker configured on a router	To avoid 2 masters, other clusters require a complex configuration with a third machine, a special quorum disk, a special interconnect
Virtual IP address primary/secondary, network load balancing, failover More info >
No dedicated proxy servers and no special network configuration are required in a SafeKit cluster for virtual IP addresses	Special network configuration is required in other clusters for virtual IP addresses. Note that SafeKit offers a health check adapted to load balancers

Advanced configuration

Mirror module / pptx
- start_prim / stop_prim scripts
- userconfig.xml
- Heartbeat (<hearbeat>)
- Virtual IP address (<vip>)
- Real-time file replication (<rfs>)
- How real-time file replication works?
- Mirror's states in action
Farm module / pptx
- start_both / stop_both scripts
- userconfig.xml
- Farm heartbeats (<farm>)
- Virtual IP address (<vip>)
- Farm's states in action

Checkers / pptx
- userconfig.xml
- errd checker
- intf and ip checkers
- custom checker
- splitbrain checker for a mirror module
- tcp, ping, module checkers
- Checkers in action

Network load balancing and failover
Windows farm	Linux farm
Generic Windows farm >	Generic Linux farm >
Microsoft IIS >	-
NGINX >
Apache >
Amazon AWS farm >
Microsoft Azure farm >
Google GCP farm >
Other cloud >

Real-time file replication and failover
Windows mirror	Linux mirror
Generic Windows mirror >	Generic Linux mirror >
Microsoft SQL Server >	-
Oracle >
MariaDB >
MySQL >
PostgreSQL >
Firebird >
Windows Hyper-V >	Linux KVM >
-	Docker > Podman > Kubernetes K3S >
-	Elasticsearch >
Milestone XProtect >	-
Genetec SQL Server >	-
Hanwha Vision > Hanwha Wisenet >	-
Nedap AEOS >	-
Siemens SIMATIC WinCC > Siemens SIMATIC PCS 7 > Siemens Siveillance suite > Siemens Siveillance VMS > Siemens Desigo CC > Siemens SiPass > Siemens SIPORT >	-
Bosch AMS > Bosch BIS > Bosch BVMS >	-
Amazon AWS mirror >
Microsoft Azure mirror >
Google GCP mirror >
Other cloud >

Kubernetes K3S: the simplest high availability cluster between two redundant servers

With the synchronous replication and automatic failover provided by Evidian SafeKit

How the Evidian SafeKit software simply implements a Kubernetes K3S high availability cluster between two redundant servers?

The solution for Kubernetes K3S

How it works?

A simple solution

Partners, the success with SafeKit

How the SafeKit mirror cluster works with Kubernetes K3S?

Step 1. File replication at byte level in a mirror cluster

Step 2. Failover

Step 3. Failback and reintegration

Step 4. Return to byte-level file replication in the mirror cluster

SafeKit free trial + mirror module for Kubernetes K3S + quick installation guide

1. Download packages

2. First on both nodes

3. On the first node

4. On the second node

5. Check that the k3s SafeKit module is running on both nodes

6. Start the SafeKit web console to administer the cluster

7. Testing

8. Try the cluster with a Kubernetes application like WordPress

9. Support

10. If necessary, configure a splitbrain checker

What are the different scenarios in case of network isolation in a cluster?

A single network

Two networks with a dedicated replication network

A single network and a splitbrain checker

Typical usage with SafeKit

Why a replication of a few Tera-bytes?

Why a replication < 1,000,000 files?

Why a failover ≤ 32 replicated VMs?

Why a LAN/VLAN network between remote sites?

SafeKit Modules for Plug&Play Redundancy and High Availability Solutions

Network load balancing and failover

Advanced clustering architectures

Real-time file replication and failover

Evidian SafeKit Webinar

SafeKit Customers in all Business Activities

Best use cases [+]

Video management, access control, building management [+]

TV broadcasting [+]

Finance [+]

Industry [+]

Air traffic control [+]

Bank [+]

Transport [+]

Healthcare [+]

Government [+]

SafeKit High Availability Differentiators against Competition

Evidian SafeKit mirror cluster with real-time file replication and failover

Evidian SafeKit farm cluster with load balancing and failover

Evidian SafeKit 8.2

All new features compared to SafeKit 7.5 described in the release notes

Packages

One-month license key

Technical documentation

Training

Product information

Modules and quick installation

SafeKit 8.2 Training

Introduction

Installation, Console, CLI

Advanced configuration

Troubleshooting

Support