Fault-Tolerant Cluster System

From GM-RKB
Jump to navigation Jump to search

A Fault-Tolerant Cluster System is a computing cluster that can guarantee High Availability.



References

2015

  • (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/high-availability_cluster Retrieved:2015-1-13.
    • High-availability clusters (also known as HA clusters or failover clusters) are groups of computers that support server applications that can be reliably utilized with a minimum of down-time. They operate by harnessing redundant computers in groups or clusters that provide continued service when system components fail. Without clustering, if a server running a particular application crashes, the application will be unavailable until the crashed server is fixed. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate filesystems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.

      HA clusters are often used for critical databases, file sharing on a network, business applications, and customer services such as electronic commerce websites.

      HA cluster implementations attempt to build redundancy into a cluster to eliminate single points of failure, including multiple network connections and data storage which is redundantly connected via storage area networks.

      HA clusters usually use a heartbeat private network connection which is used to monitor the health and status of each node in the cluster. One subtle but serious condition all clustering software must be able to handle is split-brain, which occurs when all of the private links go down simultaneously, but the cluster nodes are still running. If that happens, each node in the cluster may mistakenly decide that every other node has gone down and attempt to start services that other nodes are still running. Having duplicate instances of services may cause data corruption on the shared storage.