High Availability explained in plain English

Following on from my previous article about backups I thought I’d do a plain English explanation of High Availability (HA).

Picture this scenario – You are about to launch a new product, you’ve sent out the marketing email to your email list and are expecting 1,000’s of people to be flocking to your website to buy.  Suddenly there’s a problem with the server that your website runs on and the site is now offline!

If your system is not setup for High Availability then this problem may have your site offline for hours of even days. That’s hours or  days of missed sales.

With a Highly Available website the server problem won’t take your site offline or if it does it’s only for a few minutes depending on the type of HA used!

So what is it?

Quite simply High Availability is a term used to describe if a system is designed to cope with various failures with minimal, or even zero downtime.

Measuring Uptime and Downtime

Uptime is a measure of the amount of time a computer system is on and working vs the times it is meant to be available. Downtime is simply the inverse of this. Typically for things like websites we would expect this to be available 24 hours a day, 7 days a week. This is expressed in a percentage. So a 24/7 system that achieves 99% availability means that it was unavailable for 3.65 days in a year or 7.3 hours per month.

When you start looking into HA you will see something called the number of nines – starting with 99% and going up to 99.9%, then 99.99% and so on. The higher into the 9’s we go the more difficult it is to achieve and the more expensive it becomes!

Here’s a handy table that shows what different availability number equal in days / hours / minutes of downtime.

Uptime % Days per Year Hours per Month Minutes per Week
99.00% 3.65 7.30 101.08
99.50% 1.82 3.65 50.54
99.90% 0.36 0.73 10.11
99.99% 0.04 0.07 1.01

So you can see that going from a host that provides 99% availability to 99.9% can have a big impact on your business systems availability.

How it’s done?

Not easily is the answer. The entire system needs to be designed to cater for HA to avoid what is called a single point of failure. For example in a simple website there are a number of layers that provide the service. First there is the web server itself, then typically a database if you’re using a CMS (Content Management System). Each of these components require network connectivity to communicate. The servers also have many points of failure like hard drives, memory, motherboards, power supplies and the list goes on…

A simple way for us to achieve HA is to have another server (or many other servers) that are configured to provide the service as the main server. In this configuration the main server is known as the Master server and the others are called standbys.

What’s the temperature?

Cold

This usually means that a standby server is configured to replace the failed master, however the standby is not turned on. When the master fails the standby needs to be brought online and brought up to date to match the master. This can often be a time consuming process thus not providing as many “nines”, but is cheaper to implement.

Warm / Hot

This means the standby server is configured and is turned on and receiving regular data updates from the master server. When the master fails the standby is brought online to replace the master. This could be manual (warm) or automated (hot), depending on the way it’s configured.

Active – Active

Another configuration worth understanding is Active – Active. This is where 2 or more servers are configured and all are providing the same service at the same time. This can also be called a server cluster because we usually have more than 2 servers. In this setup if one of the servers fails then the system will continue operating without any impact on availability. This is the most complex form of HA and has costs that match!

Redundancy but not backup

One thing to note is that HA setups like I have outlined provide a form of redundancy to protect against failure. They do not provide any form of backup protection. Since data replication is usually occurring if you delete all of your data from the master – it will all be deleted from the standby as well!

Make sure you have your backup solution in place.

Do you need it?

So after reading this you may be thinking – do I need this? Well, it’s a decision you will need to make as you evaluate your overall IT strategy.

My recommendation is to look at the table above and determine for each of the IT systems you use in your business what’s the acceptable amount of downtime you would be willing to accept. Remember that the higher into the nines you go the more expensive things get.

You should also check your existing agreement for any IT services. Anything that doesn’t meet your requirements should be factored into your next quarterly IT strategy review!

Posted in Explanations.

Leave a Reply

Your email address will not be published. Required fields are marked *

Yes, subscribe me!