Synchronous replication vs asynchronous replication

Data loss or not on application failover?
There is a significant difference between synchronous replication vs asynchronous replication. According the choice, you may have data loss on application failover.

Synchronous replication as implemented by the SafeKit software is essential for failover of transactional applications. With synchronous replication, all committed data on the disk of the first server are on the disk of the second server. With asynchronous replication, committed data on the disk of the first server can be lost in case of failure. There is also an alternative solution named semi-synchronous replication, with commited data on the second server but not necessary on disk.
To help you to take the right decision if you have to choose between synchronous replication vs asynchronous replication, we explain now the technical mechanisms and the impact on application failover.
Synchronous replication requires the bandwidth of a LAN between the servers, possibly with an extended LAN in two geographically remote computer rooms. Asynchronous replication can be implemented on a low speed WAN.
Synchronous replication
With synchronous file-based replication as implemented by the SafeKit high availability software, when a disk IO is performed by the application or by the file cache system on the primary server inside a replicated file, SafeKit waits for the IO acknowledgement from the local disk and from the secondary server, before sending the IO acknowledgement to the application or to the file system cache. This mechanism is essential for failover of transactional applications. Note that SafeKit makes byte-level file replication by replicating directories and not entire disks, which greatly simplifies the configuration of a cluster.
Asynchronous replication
With asynchronous file-based replication implemented by most solutions, the IOs are placed in a queue on the primary server but the primary server does not wait for the IO acknowledgments of the secondary server. So, all data that did not have time to be copied across the network on the second server is lost if the first server fails. In particular, a transactional application loses committed data in case of failure.
Semi-synchronous replication
With the semi-synchronous file-based replication as implemented by the SafeKit high availability software, the asynchrony is not made on the primary server but on the secondary one. In this solution, SafeKit always waits for the acknowledgement of the two servers before sending the acknowledgement to the application or the system cache. But on the secondary, there are 2 options asynchronous or synchronous.
In the semi-synchronous case, the secondary sends the acknowledgement to the primary upon receipt of the IO and writes to disk after. In the synchronous case, the secondary writes the IO to disk and then sends the acknowledgement to the primary.
But be careful, the synchronous mode on the secondary server is required if we consider a simultaneous double power outage of two servers, with inability to restart the former primary server and requirement to re-start on the secondary.
Conclusion
You see that just delaying write on the secondary server has a direct impact on critical application failover. So be very careful when choosing synchronous replication vs asynchronous replication. Always prefer a synchronous or a semi-synchronous replication for a critical application.
FAQ on Evidian SafeKit (synchronous replication)
3 demonstrations [+]
Application high availability modules [+]
Cloud solutions [+]
Customers [+]
Video Surveillance (CCTV) [+]

Transport [+]
Paris transport company (RATP) chose the SafeKit high availability and load balancing solution for the centralized control room of line 1 of the Paris subway.
20 SafeKit clusters are deployed on Windows and Linux.
Stéphane Guilmin, RATP, Project manager says:
“Automation of line 1 of the Paris subway is a major project for RATP, requiring a centralized command room (CCR) designed to resist IT failures. With SafeKit, we have three distinct advantages to meet this need. Firstly, SafeKit is a purely software solution that does not demand the use of shared disks on a SAN and network boxes for load balancing. It is very simple to separate our servers into separate machine rooms. Moreover, this clustering solution is homogeneous for our Windows and Unix platforms. SafeKit provides the three functions that we needed: load balancing between servers, automatic failover after an incident and real time data replication.”
And also, Philippe Marsol, Atos BU Transport, Integration Manager says:
“SafeKit is a simple and powerful product for application high availability. We have integrated SafeKit in our critical projects like the supervision of Paris metro Line 4 (the control room) or Marseille Line 1 and Line 2 (the operations center). Thanks to the simplicity of the product, we gained time for the integration and validation of the solution and we had also quick answers to our questions with a responsive Evidian team.”
Best use cases [+]
Distinctive advantages [+]
More on the mirror cluster [+]
Demonstration of real-time replication and failover [+]
What are the advantages [+]
- Low Complexity
- Plug&Play deployment with no specific skills
- Suitable for large deployments in many sites (very simple to deploy)
- 2 nodes (3 nodes replication possibility)
- No shared storage requirement
- No Domain Controller requirement
- Same solution on Windows and Linux
- Support Windows Server and Client OS editions
- Well documented API and support
- Synchronous data replication (no data loss in case of failure)
- Replicated directories can be in the system disk
- Supports multiple heartbeats and vitual IP addresses
- Offers configurable software, hardware and network checkers
- For the quorum, does not require a special disk or a third machine or a dedicated link between both servers
- Automatic failover of services with a recovery time in the order of one minute
- Automatic failback when a server comes back after a failure (no manual operation)
- A very simple console to deploy the solution and to maintain it afterwards for end-customer
- Supports human errors (40% of causes of unavailability) thanks to its simplicity
- Supports software failures (40% of causes of unavailability): regression on software update (version N and N+1 can coexist), Operating System frozen, software bug
- Supports hardware and environment failures (20% of causes of unavailability), including the complete failure of a computer room with 2 nodes in two remote sites
What is the recovery time (RTO) [+]
RTO is the time during which the application is unavailable in case of failure. RTO of the SafeKit mirror solution is in the order of 1 mn.
For a hardware failure, RTO = heartbeat timeout (default 30 s, can be changed in userconfig.xml) + time to restart services.
For a software failure or an administrator restart, RTO = time to (cleanly) stop services + time to restart them.
Be careful, with solutions that reboot a full virtual machine in case of failure, the RTO is unpredictable as manual operations may be required after a hardware crash to reboot the virtual machine.
What is the data loss (RPO) [+]
RPO reflects the data loss in case of failure. RPO of the SafeKit mirror solution is 0 as the replication is synchronous and real-time.
Be careful, with asynchronous replication, RPO is not 0 and there is data loss in case of failure when the application restarts on the secondary server.
More information on the architecture [+]
More on the farm cluster [+]
Demonstration of load balancing and failover [+]
What are the advantages [+]
- Low Complexity
- Plug&Play deployment with no specific skills
- Suitable for large deployments in many sites (very simple to deploy)
- 2 nodes or more
- No network load balancers requirement
- No proxy server requirement (above the farm cluster)
- No Domain Controller requirement
- No restriction in VMware due to multicast or unicast address
- Same solution on Windows and Linux
- Support Windows Server and Client OS editions
- Well documented API and support
- Supports multiple monitoring channels on multiple networks for server failure detection
- Supports multiple vitual IP addresses
- Offers configurable software, hardware and network checkers
- Offers the mirror cluster with synchronous real-time replication and failover
- Automatic failover with a recovery time in the order of a few seconds
- Automatic failback when a server comes back after a failure (no manual operation)
- A very simple console to deploy the solution and to maintain it afterwards for end-customer
- Supports human errors (40% of causes of unavailability) thanks to its simplicity
- Supports software failures (40% of causes of unavailability): regression on software update (version N and N+1 can coexist), Operating System frozen, software bug
- Supports hardware and environment failures (20% of causes of unavailability), including the complete failure of a computer room with 2 nodes in two remote sites
What is the recovery time (RTO) [+]
RTO is the time during which the application is unavailable in case of failure. RTO of the SafeKit farm solution is in the order of a few seconds on hardware failure.
For a hardware failure, RTO = failure detection timeout through monitoring channels (default a few seconds): after the timeout the load balancing filters are reconfigured.
For a software failure or an administrator restart, RTO = time to (cleanly) stop services + time to restart them.
More information on the architecture [+]
SafeKit Webinar [+]
Pricing - Free trial [+]