High-Availability Solutions for the SAP System

This section summarizes the available high-availability solutions for the SAP system, looking at general features, product range, architecture, and functionality.

High-availability solutions protect system services by switching them over to standby resources in case a critical resource fails. These solutions address the single points of failure in hardware that cannot be protected by standard technology (such as hot pluggable RAID, UPS, backup power supply, and so on). If you use high-availability solutions with the standard technologies discussed elsewhere in this documentation, you can substantially improve the availability of your SAP system by comprehensively covering its single points of failure.

High-availability solutions offer a certain level of automation in monitoring the health of system components as well as in the detection of and reaction to component failures. High-availability solutions clearly cannot guarantee zero downtime. However, they can limit the impact of host machine failures to your SAP system and restrict its unplanned downtime to tolerable levels.

High-availability solutions allow the definition of highly available cluster systems, which are defined as a number of loosely coupled hosts with shared disks. High-availability solutions are capable of monitoring and controlling different system resources such as host machines, network adapters so on. In the event of failure, the service offered by the resource is automatically taken over by a standby resource.

This section focuses on how to use high-availability solutions to protect the SAP system against failures of host machines (such as power supply failure, CPU failure, board failure), which are of key importance to the availability of your SAP system.

High-availability solutions can be:

Part of the operating system
Closely attached to the operating system, but not actually part of it

Integration

For the Windows operating system, Microsoft offers the Windows Server Failover Clustering feature, which is an example of a high-availability solution that is part of the operating system. SAP supports this product as the standard failover solution for Windows.

For more information, see the Windows installation guides at:

http://service.sap.com/instguidesNW

For the second type of high-availability solutions – that is, closely attached to the operating system but not actually part of it – you can contact certified hardware partners.

For more information, see: http://scn.sap.com/docs/DOC-8541 .

Activities

SAP does not certify high-availability solutions for the compliance with SAP products. However, SAP provides both detailed technical guidelines.

At the center of high-availability solutions are one or more software components, usually called the “cluster manager”. The cluster manager establishes a heartbeat between the cluster nodes, which is used diagnostically to decide whether a network link or a node has gone down. The cluster manager might also monitor other local resources and take appropriate actions in the event of failure. However, this section focuses on host machine failures and corresponding high-availability solutions.

In the SAP system, all data is stored on a central database. Disk sharing is therefore necessary to make failover consistent and transparent for the end user. The normal implementation for this is a twin-tailed SCSI bus, in which each connected node has its own SCSI interface (so using an additional SCSI-ID). Some products also allow the usage of a proprietary disk subsystem. Shared disk access is controlled by an additional component of the high-availability solution. Access is usually exclusive to one of the nodes, and accordingly has to be switched with any SAP service failover.

Apart from the surveillance of network links by a heartbeat, the network adapters are also controlled. High-availability solutions enable configuration of an additional network adapter as a “standby” for the primary adapter. If the primary one fails, the standby adapter, which does not have an IP address, takes over the IP address of the primary. For the purpose of increasing the redundancy in the communication link between the cluster nodes, you can connect the two network cards in each node to separate physical network hardware.

The schematic graphic below shows the logical components of a high-availability solution, showing a single cluster node and the connections to and from it. Note that the diagram does not imply a single cohesive process. For example, disk control is usually part of the operating system such as a logical volume manager and part of network control might be firmware in a physical device. It might also be located in the operating system such as in the programming of a MAC address.

This graphic is explained in the accompanying text.

High-Availability Components

Apart from the physical configuration of the high-availability environment, the most difficult part of the setup is the definition of the actions that are necessary to properly move an application over to the standby resource.

Some high-availability solutions predefine events (for example, node_down, network_down) as well as rules and actions for the cluster manager to react to these events. In most cases the actions are defined as a sequence of commands in a simple shell script. These actions include some “generic” system commands (activate shared disks, check and mount filesystems) and application-specific commands (start database, move some files).

With Windows Server Failover Clustering, SAP delivers a preconfigured solution. Therefore, you do not need to define failover events, rules, or actions.

All high-availability solutions allow management of the cluster with system commands or with more user-friendly interfaces. There are basic commands for cluster management to perform the following tasks:

Configure the cluster and its nodes
Start and stop the cluster
Add nodes to the cluster
Move applications manually to another node (for node maintenance)
Query the status of the cluster

Manual failover is useful to make cluster nodes available for maintenance (that is, to allow planned downtime).

This documentation and the current solutions focus on node failures.