Skip to main content
Outside service Partner
Hitachi Vantara Knowledge

Zero-copy failover behavior


In a SAIN system, zero-copy failover is the process of one node automatically taking over management of storage previously managed by another node that has failed. Support for zero-copy failover is configured at the storage tier and enabled in the HCP system configuration.

Zero-copy failover is supported for storage nodes only.

This chapter describes the storage setup required to support zero-copy failover and explains how zero-copy failover works.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Storage setup for zero-copy failover


In a SAIN system, nodes have both physical and logical connections to the SAN storage. The physical connections are through paths established by Fibre Channel cables and, in some systems, Fibre Channel switches. The logical connections are through mappings of the logical volumes in the storage array to the nodes.

Physical paths

In an HCP SAIN system, each node has two physical paths to the storage array. This is called multipathing.

The figure below shows two nodes, A and B, at the bottom, each with a multipath connection to the storage array at the top.

1_10.jpg

Logical mappings

To support zero-copy failover, the logical volumes in the storage array must be cross-mapped to the nodes. Cross-mapping means that each logical volume that maps to one node (A) must also map to the same second node (B), called the peer node. The mappings to node A are the primary mappings for those logical volumes, and the mappings to node B are the standby mappings. Similarly, each logical volume with a primary mapping to node B must have a standby mapping to node A.

The figure below shows two sets of logical volumes in a storage array that map to two nodes, A and B. The primary mappings for each node are shown in blue. The secondary mappings are shown in red.

ZCF 1_1.jpg

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Failing over and failing back


Under normal circumstances, each node in a cross-mapped pair manages the logical volumes with primary mappings to it. However if one node becomes unavailable and zero-copy failover is enabled, its peer can take over management of those volumes. The process of one node taking over storage management from another is called failover.

When an unavailable node rejoins the HCP system, it normally takes back management of its own logical volumes. This process is called failback.

NoteWebHelp.png

Notes: 

During failover, all spindown volumes on the node taking over are spun up. During failback, all spindown volumes on both nodes are spun up.

Zero-copy failover does not apply to external storage volumes. If a node becomes unavailable, the external storage volumes it manages also become unavailable.

System Management Console information on failover

During failover, the logical volumes associated with the node that became unavailable first appear as initialized ( VolumeStarted.png ) on the Hardware page in the HCP System Management Console. After the peer takes over a logical volume, the icon moves from the row for the unavailable node to the row for the peer, where it once again shows as available ( VolumeAvailable.png ). For information on how logical volume status is represented on the Hardware page, see About the Nodes page.

While any logical volumes are in a failed-over state, this alert appears in the alerts section on the Overview page:

FailedOverNodes.png Data access failover

For information on the alerts section of the Overview page, see Alerts.

Data outages during failover

While a logical volume is failing over, the data stored on it is temporarily unavailable. If the node that managed the volume became unavailable because you shut it down from the System Management Console, the data outage lasts less than five minutes.

If the node became unavailable for some other reason (for example, all physical paths between the node and the storage array broke, or the node itself failed), the data outage can last significantly longer. Factors that affect how long it lasts include:

The number of logical volumes involved

The size of the logical volumes

The number of objects stored on the logical volumes

The amount of data storage activity occurring when the node became unavailable

Data outages during node restart

When you restart a node from the System Management Console, its storage fails over to its peer during the shutdown part of the restart. The failover process finishes before the node completes its shutdown processing. The data outage caused by the failover lasts less than five minutes.

When the node comes back up, the failback process extends the startup processing by about 15 to 30 seconds. The data outage caused by the failback also lasts less than five minutes.

Data protection mechanism

Zero-copy failover uses a data protection mechanism that prevents the two nodes in a peer group from accidentally overwriting each other’s storage. This mechanism needs to be accessible on all storage paths. It is used any time a node in the pair needs to mount or unmount a storage volume. This occurs when:

A node starts

A node stops

A node takes over storage management from its peer (failover)

A node releases control of storage it’s managing (failback)

To ensure access to the data protection mechanism, both paths between the node and the storage array must be available during these transitions. If either path is not available, the node makes itself unavailable to guarantee the safety of the data.

If a path outage occurs, you should find and repair the cause of the outage before trying to start or stop any affected node.

Failover and physical path outages

When one physical path between a node and the storage array breaks, processing continues normally. If the second path for that node breaks while the first one is still broken, the node becomes unavailable, and failover occurs.

To return the node to service and fail back the logical volumes, you need to fix both paths before you reboot the node. The node will not return to service while either path is broken.

If a node managing failed-over storage has a path outage while its peer is still unavailable, processing continues normally. However, if the peer returns to service and takes back its storage, the node with the path outage then becomes unavailable, and failover occurs in the other direction. That is:

1.Node A fails.

2.Node A storage fails over to node B.

3.Node B has a path outage.

4.Node A returns to service.

5.Node A storage fails back to node A.

6.Node B becomes unavailable.

7.Node B storage fails over to node A.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.