Skip to main content
Hitachi Vantara Knowledge

Failover and failback considerations

These basic rules apply to replication, failover, and failback on replication links, regardless of the replication topology:

  • Multiple failed-over active/active links in a replication topology can be failed back in any order.
  • With failback of multiple failed-over active/passive links in a replication topology, order matters.
  • If a topology includes both failed-over active/active links and failed-over active/passive links, order matters for failing back the active/passive links, but the active/active links can be failed back in any order and at any time.
  • With active/passive links, failover occurs from the primary system to the replica for the same link. Failover cannot occur from the primary system for one link to the replica for a different link.
  • With active/passive links, failback occurs from the replica to the primary system for the same link. Failback cannot occur from the replica for one link to the primary system for another link.
  • With an active/passive link, when a link fails over to the replica, only the HCP tenants and namespaces and default-namespace directories that were read-write on the primary system become read-write on the replica.
  • In a complex replication topology that includes only active/passive links in many-to-one and/or chained relationships, the HCP tenants and namespaces and default-namespace directories being replicated in the topology are read-write on at most one HCP system at a time. This is true regardless of the type of activity on the links.

Additional considerations apply to failover and failback in replication topologies that include multiple links.

Failover and failback in an active/passive many-to-one topology

In an active/passive many-to-one replication topology, multiple HCP systems replicate to a single other HCP system.

To recover from a single primary system failure in an active/passive many-to-one replication topology, you follow the normal pattern:

  1. Fail over the link between the failed primary system and the replica.
  2. When the primary system becomes available again, restore the link from the replica.
  3. Begin and complete data recovery from the replica to the primary system.

If more than one primary system fails in an active/passive many-to-one topology, you need to fail the link from each failed system over to the replica. Multiple inbound links on the replica can be in the failed-over state at the same time.

When the failed systems become available again, you can restore the links at any time. However, you can perform data recovery on only one link at a time. (You don’t need to wait for all the failed systems to become available before recovering data to the first one.)

For example, suppose systems A, B, and C all replicate to system D. If both A and B fail, you can return to normal replication with these steps after the failed systems become available again:

  1. On system D, restore the link from A to D.
  2. On system D, begin and complete data recovery from D to A.
  3. On system D, restore the link from B to D.
  4. After the recovery of data from D to A is complete and replication from A to D has restarted, on system D, begin and complete data recovery from D to B.

Failover and failback in an active/passive chained topology

In an active/passive chained replication topology, one HCP system replicates to a second HCP system, which replicates to a third HCP system.

The way you manage failover and failback in an active/passive chained replication topology depends on which system or systems have failed. The links from the first system in the chain to the second system and from the second system to the third system function independently of each other, but order matters when you fail them over or restore and perform data recovery on them.

This section of the Help outlines the steps you need to take to return an active/passive replication chain to normal operation after the failure of any one or two of the systems in the chain. These sections assume a replication topology in which, when all three systems are healthy:

  • System A replicates to system B on link AB.
  • Link AB includes HCP tenant T1 and default-namespace directory D1, both of which were originally created on system A.
  • System B replicates to system C on link BC.
  • Link BC includes link AB and HCP tenant T2, which was originally created on system B.
  • T1 and D1 are read-write on system A and read-only on systems B and C.
  • T2 is read-write on system B and read-only on system C.

The figure below shows this topology.

GUID-548190C1-043B-4175-A84B-39355F0D58AD-low.png

Scenario: System A fails

To return to normal operation after system A fails:

Procedure

  1. On system B, fail link AB over to B.

    T1 and D1 become read-write on B.
  2. When system A becomes available again, on system B, restore link AB.

  3. On system A, accept the restored link.

  4. On system B, begin and complete data recovery on link AB.

Results

When data recovery is complete, T1 and D1 become read-write on A and read-only on B, and replication resumes on link AB.

Scenario: System B fails

To return to normal operation after system B fails:

Procedure

  1. On system C, fail link BC over to C.

    T1 and D1 remain read-write on A and read-only on C. T2 becomes read-write on C.
  2. When system B becomes available again, on system A, restore link AB.

  3. On system B, accept the restored link.

    T1 and D1 are read-write on A and read-only on B and C, and replication restarts on link AB.
  4. On system C, restore link BC.

  5. On system B, accept the restored link.

  6. On system C, begin and complete data recovery on link BC.

Results

When data recovery is complete, T1 and D1 are read-only on B and C. T2 becomes read-write on B and read-only on C, and replication resumes on link BC.

Scenario: System C fails

To return to normal operation after system C fails:

Procedure

  1. When system C becomes available again, on system B, restore link BC.

  2. On system C, accept the restored link.

Results

T1 and D1 remain read-only on B and C. T2 remains read-write on B and read-only on C, and replication restarts on link BC.

Scenario: Systems A and B fail

To return to normal operation after systems A and B fail:

Procedure

  1. On system C, fail link BC over to C.

    T1 and D1 remain read-only on C. T2 becomes read-write on C.
  2. When system B becomes available again, on system C, restore link BC.

  3. On system B, accept the restored link.

  4. On system C, begin and complete data recovery on link BC.

    When data recovery is complete, T1 and D1 are read-only on B and C. T2 becomes read-write on B and read-only on C, and replication resumes on link BC.

    System B automatically recreates link AB without the primary system IP addresses or hostname from the inbound link AB in link BC.

  5. When system A becomes available again, take either of these actions:

    • If link AB still exists on A:
      1. On system A, restore link AB.
      2. On system B, accept the restored link.
      3. On system B, begin and complete data recovery on link AB.

      When data recovery is complete, T1 and D1 become read-write on A and remain read-only on B, and replication resumes on link AB.

    • If link AB no longer exists on A, follow the steps for option one or option two below.

      Option one:

      1. On system B, update the configuration of the automatically recreated link AB to include the IP addresses or hostname for system A.
      2. On system B, restore link AB.
      3. On system A, accept the restored link.
      4. On system B, begin and complete data recovery on link AB.

        When data recovery is complete, T1 and D1 are read-write on A and read-only on B and C, and replication restarts on link AB.

      Option two:

      1. On system B, suspend and then delete link AB.

        T1 and D1 become directly included on link BC, which still includes T2, and become read-write on B.

      2. Optionally, create a new replication chain B C A:
        1. Reinstall HCP on A.
        2. On system C, create outbound link CA, including link BC as an inbound link.
        3. On system A, accept the new link.
      3. T1, D1, and T2 are read-write on B and readonly on C and A, and replication starts on link CA.

Scenario: Systems A and C fail

To return to normal operation after systems A and C fail:

Procedure

  1. On system B, fail link AB over to B.

    T1 and D1 become read-write on B. T2 remains read-write on B.
  2. When system C becomes available again, on system B, restore link BC.

  3. On system C, accept the restored link.

    T1, D1, and T2 are read-write on B and read-only on C, and replication restarts on link BC.
  4. When system A becomes available again, on system B, restore link AB.

  5. On system A, accept the restored link.

  6. On system B, begin and complete data recovery on link AB.

Results

When data recovery is complete, T1 and D1 become read-write on A and read-only on B, and replication resumes on link AB.

Scenario: Systems B and C fail

To return to normal operation after systems B and C fail:

Procedure

  1. When system B becomes available again, on system A, restore link AB.

  2. On system B, accept the restored link.

    T1 and D1 are read-write on A and read-only on B, and replication restarts.
  3. When system C becomes available again, if T1 and D1 still exist on system C, delete them.

    This requires that all objects in the namespaces owned by T1 and in D1 be deleted first.
    NoteIf T2 still exists on systems B and C, delete it from C. If T2 still exists on system C and not on system B, you can recover it to B by creating a link from C to B. However, you cannot then replicate T2 from B to C unless you first delete it from C.
  4. On system B, create outbound link BC, including link AB as an inbound link. If T2 still exists on B, also include T2 in the link.

  5. On system C, accept the new link.

Results

T1 and D1 are read-only on B and C. If T2 is on link BC, T2 is read-write on B and read-only on C. Replication restarts on link BC.

Failover and failback in an active/passive one-to-many topology

In an active/passive one-to many replication topology, one HCP system replicates to two or more other systems.

If one of the replicas fails, you follow the normal pattern for recovering from a replica failure. If more than one replica fails, you follow the normal recovery pattern for each replication link individually. The order in which you perform the recovery procedures doesn’t matter.

Throughout these failure scenarios, the HCP tenants and namespaces and default-namespace directories on each link remain read-write on the primary system and read-only on the replicas. Therefore, even if the links include the same items, no conflicts can occur.

However, if two or more links include the same HCP tenants and namespaces and default-namespace directories and the primary system fails, these items can be read-write on multiple systems at the same time. This can lead to conflicts during data recovery.

For example, assume a replication topology in which, when all three systems are healthy:

  • System A replicates to system B on link AB. Link AB includes HCP tenant T1.
  • System A replicates to system C on link AC. Link AC also includes HCP tenant T1.

Procedure

  1. On system B, fail link AB over to B.

    T1 becomes read-write on B and read-only on A and remains read-only on C.
  2. On system C, fail link AC over to C.

    T1 becomes read-write on C and remains read-only on A and read-write on B. T1 is now read-write on two systems.
    TipTo prevent recovery conflicts, ensure that clients write to only system B or only system C while both systems are read-write.
  3. When system A becomes available again, on system B, restore link AB.

    You could restore link AC first. The order in which you restore the links doesn’t matter.
  4. On system A, accept the restored link.

  5. On system B, begin and complete data recovery on link AB.

    When data recovery is complete, T1 remains read-only on A because link AC is stilled failed over to C. It becomes read-only on B and remains read-write on C. Replication resumes on link AB.
  6. When data recovery on link AB is complete, on system C, restore link AC.

  7. On system A, accept the restored link.

  8. On system C, begin and complete data recovery on link AC.

Results

T1 becomes read-write on A and read-only on C and remains read-only on B. Replication resumes on link AC.

Failover and failback in an active/passive many-to-one topology with disaster recovery support

When HCP systems fail in an active/passive many-to-one topology with disaster recovery support, you need to combine the failback patterns for the many-to-one and chained topologies.

For example, assume a replication topology in which, when all five systems are healthy:

  • System A replicates to system D on link AD. Link AD includes HCP tenant T1, which was originally created on system A.
  • System B replicates to system D on link BD. Link BD includes HCP tenant T2, which was originally created on system B.
  • System C replicates to system D on link CD. Link CD includes HCP tenant T3, which was originally created on system C.
  • System D replicates to system E on link DE. Link DE includes links AD, BD, and CD and tenant T4, which was originally created on system D.
  • T1, T2, and T3 are read-write on systems A, B, and C, respectively, and read-only on systems D and E.
  • T4 is read-write on system D and read-only on system E.

The figure below shows this topology.

GUID-8C3C1C5D-591C-4A6D-AF0B-889B0AC7ED2B-low.png

To return to normal operation after systems A, B, and D fail:

Procedure

  1. On system E, fail link DE over to E.

    T1, T2, and T3 are read-only on E. T4 becomes read-write on E. T3 remains read-write on C.
  2. When system D becomes available again, on system E, restore link DE.

  3. On system D, accept the restored link.

  4. On system E, begin and complete data recovery on link DE.

    When data recovery is complete, T1, T2, and T3 are read-only on D and E. T4 becomes read-write on D and read-only on E, and replication resumes on link DE.
  5. On system C, restore link CD.

  6. On system D, accept the restored link.

    T3 is read-write on C and read-only on D, and replication restarts on link CD.
  7. When system A becomes available again, on system D, update the configuration of the automatically recreated link AD.

  8. On system D, restore link AD.

  9. On system A, accept the restored link.

  10. On system D, begin and complete data recovery on link AD.

    When data recovery is complete, T1 becomes read-write on A and remains read-only on D. Replication resumes on link AD.
  11. When system B becomes available again and data recovery on link AD is complete, on system D, update the configuration of the automatically recreated link BD.

  12. On system D, restore link BD.

  13. On system B, accept the restored link.

  14. On system D, begin and complete data recovery on link BD.

Results

When data recovery is complete, T2 becomes read-write on B and remains read-only on D. Replication resumes on link BD.

 

  • Was this article helpful?