High availability can be seen as an extension of redundancy. Where a system or component can be redundant, failover to it may not be an instantaneous or automatic process. There might be manual intervention required, or there may be a delay in the system coming back online.
High availability aims to reduce or remove downtime by maintaining redundant systems that can be switched to instantaneously, or at least with the minimum of impact.
Take databases as an example. In the dim, dark past, a database might be reliant on a technology like log-shipping to offer redundancy in case of an outage.
This relies on nightly backups plus up-to-date transaction log copies for the intervening period, and ultimately will result in some data loss (for any logs that had not completed transmission prior to the outage) and some service downtime while your DBAs work to bring the system back online.
Some people might refer to this as a warm standby. At best, it’s probably tepid.
With database clustering on the other hand, the storage between the primary and secondary nodes is shared – perhaps with a SAN volume or replicated disks – so there is no requirement for backups and logs to be shipped.
Both the primary and secondary node are active and contain all shared configuration, including the same networking and DNS configuration, which can save precious time during an outage.
Only one node can have access to the storage at a time (known as ownership or majority) and the secondary waits, pre-configured and ready to go, until such a time as it loses heartbeat connectivity with the primary – then the secondary takes ownership of the storage, and becomes the primary node.
The process takes a few seconds, and is almost unnoticeable from an end-user perspective, so it’s easy to see why full high availability is preferable to simple redundancy where possible. Remember: all high availability is a form of redundancy, but not all redundancy is high availability.
Disaster recovery does exactly what it says on the tin. If something irrevocably dreadful happens, that causes a major outage to your systems or service, then disaster recovery serves as a fallback option to allow you to continue operation.
Usually it comes in the form of a secondary set of systems, geographically separate from your primary site. Your disaster recovery site could be a ready-to-go system that contains an entire copy of your production infrastructure, or it might be a seed environment, that contains the latest customer data and applications, but may require administrative intervention to populate in the event it becomes required.
Usually the option you select is a question of budget: fully built disaster recovery systems (that may even be operated in an active/active capacity alongside the production site) are often prohibitively expensive.
The requirement for hardware and systems that are for the most part sitting idle – plus a massive pipe to ensure the secondary site is up-to-the-minute with data from the primary – usually leads to organisations maintaining their disaster recovery site as a warm standby.
Recovery time objectives (the time it takes to bring services online at the secondary site) and recovery point objectives (how much data will not yet have replicated to the secondary site when the incident occurred) are all compromises that have to be considered when planning a disaster recovery solution.