Skip to main content
Outside service Partner
Hitachi Vantara Knowledge

Managing failover and failback


When one system involved in an active/active link fails or when the primary system for an active/passive link fails, you can fail over the link to the other system involved in the link. Failing over the link stops replication on the link and, if DNS failover is enabled, allows client requests that target the failed system to be redirected to the healthy system.

If a system failure results in a replication link being broken (for example, due to the system being rebuilt after a catastrophic failure), you need to restore the link before replication can restart or data recovery can occur on that link. This applies regardless of the link type and, for an active/passive link, regardless of whether the failed system is the primary system or the replica.

To restart replication after failing over a link, you need to fail back the link. Failing back the link restarts replication on the link and returns the HCP systems involved to normal operation. For an active/passive link, failing back includes recovering data from the replica to the primary system.

Failover can be automated for both active/active and active/passive links. Failback can be partially automated for active/passive links.

This section of the Help provides instructions and considerations for managing failover and failback manually. For an overview of manual and automatic failover and failback with HCP replication, see Failover and failback.

RoleWebHelp.png

Roles: To fail over a replication link, restore a link, or recover replicated data, you need the administrator role.

NoteWebHelp.png

Note: You can also use the HCP management API to manage failover and failback of replication links. For information on doing this, see Replication resources.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Failover and failback workflows


Three failover and failback scenarios are possible, depending on which of these HCP systems fails:

One of the systems involved in an active/active link

The primary system for an active/passive link

The replica for an active/passive link

This section of the Help describes the basic workflows for these scenarios. For information on failover and recovery workflows in more complex replication topologies, see Failover and failback considerations.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

System failure workflow with an active/active link


The table below outlines what happens when one of the systems involved in an active/active link fails, where the system that fails is system A and the system that remains healthy is system B.

Step What you do What happens

System A fails

1

On system B, fail over the link

If DNS failover is enabled, system B broadcasts new DNS configuration

2

If DNS failover is disabled, direct clients to write only to system B

 

System A comes back online

3

If system A has been rebuilt:

On system A, upload the replication SSL server certificate from system B

On system B, upload the replication SSL server certificate from system A

 
4

On system B, update the link configuration as needed

 
5

If the link is broken, on system B, send a request to restore the link

Replication link is recreated

6

On system B, fail back the link

System A and system B broadcast original DNS configurations; replication restarts in both directions on the link

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Primary system failure workflow


The table below outlines what happens when the primary system for an active/passive link fails.

Step What you do What happens

Primary system fails

1

On the replica, fail over the link

Applicable tenants and directories on the replica become read-write; applicable tenants and directories on the primary system either remain read-write or become read-only depending on whether the two systems can communicate with each other; if DNS failover is enabled, the replica broadcasts new DNS configuration

2

If DNS failover is disabled, direct clients to write only to the replica

 

Primary system comes back online

3

If the primary system has been rebuilt:

On the primary system, upload the replication SSL server certificate from the replica

On the replica, upload the replication SSL server certificate from the primary system

 
4

On the replica, update the link configuration as needed

 
5

If the link is broken, on the replica, send a request to restore the link

Replication link is recreated

6

On the replica, begin data recovery

Applicable tenants and directories on the replica remain read-write; applicable tenants and directories on the primary system remain or become read-only; data recovery from the replica to the primary system begins

7

Wait for data recovery to come close to being up to date

 
8

On the replica, complete data recovery

Applicable tenants and directories on the replica become read-only; applicable tenants and directories on the primary system remain read-only; data recovery from the replica to the primary system continues to completion

Data recovery finishes

9 Nothing

Applicable tenants and directories on the replica remain read-only; applicable tenants and directories on the primary system become read-write; the primary system and the replica broadcast original DNS configurations; replication from the primary system to the replica restarts

10

If DNS failover is disabled, after you see this message in the system log, direct clients to write only to the primary system: Replication data recovery completed

 

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Replica failure workflow


The table below outlines what happens when the replica for an active/passive link fails.

Step What you do What happens

Replica fails

1

On the primary system, suspend activity on the link

 

Replica comes back online

2

If the replica has been rebuilt:

On the replica, upload the replication SSL server certificate from the primary system

On the primary system, upload the replication SSL server certificate from the replica

 
3

On the primary system, update the link configuration as needed

 
4

If the link is broken, on the primary system, send a request to restore the link

Replication link is recreated; applicable tenants and directories on the primary system remain read-write; applicable tenants and directories on the replica are read-only

5

On the primary system or the replica, resume activity on the link

Replication from the primary system to the replica restarts from the beginning

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Failing over


To fail over a replication link:

1.In the top-level menu of the HCP System Management Console for the system you want to fail over to, select Services Replication.

2.On the replication Links page, click on the link you want to fail over.

3.On the replication link details page, click on Link.

4.In the replication Link panel, click on the Failover tab.

5.In the link Failover panel, click on Fail Over.

A confirming message appears.

6.In the window with the confirming message, select I understand to confirm that you understand the consequences of your action. Then click on Fail Over Link.

7.For an active/passive link, if DNS failover is disabled, tell the applicable tenant administrators to direct all client access requests to the replica.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Recovering from a failure


During a catastrophic failure, an HCP system can lose all configuration information. In this case, if the system participates in a replication link, you need to restore the link configuration after the system is rebuilt. If the failure did not cause the system to lose the link configuration, you don’t need to restore the link.

Once the link configuration exists on both systems involved in the link:

For an active/active link, you need to perform the failback procedure.

For an active/passive link:

oIf the primary system failed, you need to recover namespace content and other applicable information from the replica.

NoteWebHelp.png

Note: After you fail over an active/passive link, the only way to return to normal replication is by going through the data recovery procedure. You need to perform this procedure even if you don’t need to restore the link and even if no changes have been made to the configuration or content of the replicated items. Even when nothing has change, the data recovery process can take more than five minutes.

oIf the replica failed, replication automatically restarts, beginning again with the objects with the oldest metadata changes either across all namespaces or within each namespace, depending on the link configuration.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Restoring a link


Before you can restore a replication link, both systems involved in the link must have the required SSL certificates installed, as described in Configuring SSL for replication.

You don't need to restore a link if both systems involved in the link have the correct link configuration. In this case, you can skip this step of the recovery procedure even though the Restore Link button is active.

To restore a replication link after a system failure:

1.In the top-level menu of the HCP System Management Console for the system where the link configuration still exists, select Services Replication.

2.On the replication Links page, click on the link you want to restore.

3.On the replication link details page, click on Link.

4.If necessary, update the link configuration. You need to do this only if the domain name or IP addresses of the other system have changed.

If the domain name of the other system has not changed, do not replace the displayed IP addresses with the domain name.

NoteWebHelp.png

Note: After the failure of a system that's part of an erasure coding topology, if the replacement system has a different domain name or different IP addresses from the failed system, you need to reconfigure all the replication links that connect other systems in the topology to the failed system. When reconfiguring the links, ensure that they all connect to the same replacement or rebuilt system. If the links connect to different systems, the erasure coding topology will no longer be functional.

For instructions on updating the link configuration, see Modifying a replication link.

5.In the replication Link panel, click on the Failover tab.

6.In the Failover panel, click on Restore Link.

NoteWebHelp.png

Note: If you don’t see the Restore Link button, wait a few minutes and then redisplay the Replication page. Before the System Management Console can display this button, HCP needs to recognize that the other system is available, has the necessary SSL certificates installed, and isn’t aware of the existing link.

After you restore a link, the Restore Link button remains active. This state enables you to repeat the restore procedure if the system that failed needs to be rebuilt again before you fail back or begin recovery on the link. Even though the button is still active, you can proceed to the next step in the recovery procedure.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Failing back an active/active link


To fail back an active/active link:

1.In the top-level menu of the HCP System Management Console for the system you failed over to, select Services Replication.

2.On the replication Links page, click on the link you want to fail back.

3.On the replication link details page, click on Link.

4.In the replication Link panel, click on the Failover tab.

5.In the link Failover panel, click on Fail Back.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Recovering the data after a primary system failure


To recover namespace content and other applicable information from a replica to a primary system:

1.In the top-level menu of the HCP System Management Console for the replica, select Services Replication.

2.On the replication Links page, click on the link on which you want to recover data.

3.On the replication link details page, click on Link.

4.In the replication Link panel, click on the Failover tab.

5.In the link Failover panel, click on Begin Recovery.

NoteWebHelp.png

Note: After uploading new trusted replication server certificates, you may need to wait more than ten minutes for the Begin Recovery button to become active.

The applicable HCP tenants and default-namespace directories become read-only on the primary system, and the replication service starts copying the applicable objects and configuration information from the replica to the primary system. As with replication from the primary system to the replica, the service starts with the objects with the oldest metadata changes either across all namespaces or within each namespace, depending on the link configuration.

NoteWebHelp.png

Note:  If the primary system cannot communicate with Active Directory and either of these is true for a tenant, recovery of that tenant is automatically paused:

The tenant to be recovered supports AD authentication.

A namespace owned by the tenant supports AD single sign-on.

When communication between the primary system and AD is restored, you can resume recovery of the tenant.

6.Monitor the recovery process by periodically reviewing the information in the status Overview and status Tenants panels for the link.

7.When data recovery is almost synchronized with current tenant and namespace activities on the replica, return to the Failover panel for the link. Synchronization is nearing completion when the up-to-date-as-of time for the link is close to zero.

NoteWebHelp.png

Note: As long as clients continue writing to the replica, synchronization won’t reach one hundred percent. Synchronization doesn’t need to be completely up to date for you to start the complete recovery phase.

8.In the link Failover panel, click on Complete Recovery.

The applicable tenants and directories on the replica immediately become read-only. The tenants and directories on both systems then remain read-only until the replication service finishes the data recovery. The amount of time this takes depends on how much data is left to recover.

When recovery is complete, the tenants and directories on the primary system become read-write, those on the replica remain read-only, and the replication service on the primary system starts copying objects to the replica again.

TipWebHelp.png

Tips: 

You can schedule completion of the data recovery process for a time when client usage of the repository is low.

If, before final recovery is complete, you need to allow clients to write to the applicable tenants and directories on the replica again, click on Cancel Final Recovery in the link Failover panel. The recovery process continues, but the applicable tenants and directories become read-write on the replica and remain read-only on the primary system until you click on Complete Recovery again.

9.If DNS failover is disabled:

a.Wait for this message to appear in the system log:

Replication data recovery completed

b.Tell the applicable tenant administrators to redirect all client access requests to the primary system.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Failover and failback considerations


These basic rules apply to replication, failover, and failback on replication links, regardless of the replication topology:

Multiple failed-over active/active links in a replication topology can be failed back in any order.

With failback of multiple failed-over active/passive links in a replication topology, order matters.

If a topology includes both failed-over active/active links and failed-over active/passive links, order matters for failing back the active/passive links, but the active/active links can be failed back in any order and at any time.

With active/passive links, failover occurs from the primary system to the replica for the same link. Failover cannot occur from the primary system for one link to the replica for a different link.

With active/passive links, failback occurs from the replica to the primary system for the same link. Failback cannot occur from the replica for one link to the primary system for another link.

With an active/passive link, when a link fails over to the replica, only the HCP tenants and namespaces and default-namespace directories that were read-write on the primary system become read-write on the replica.

In a complex replication topology that includes only active/passive links in many-to-one and/or chained relationships, the HCP tenants and namespaces and default-namespace directories being replicated in the topology are read-write on at most one HCP system at a time. This is true regardless of the type of activity on the links.

Additional considerations apply to failover and failback in replication topologies that include multiple links.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Failover and failback in an active/passive many-to-one topology


In an active/passive many-to-one replication topology, multiple HCP systems replicate to a single other HCP system. For an explanation of this topology, see Active/passive many-to-one replication.

To recover from a single primary system failure in an active/passive many-to-one replication topology, you follow the normal pattern:

1.Fail over the link between the failed primary system and the replica.

2.When the primary system becomes available again, restore the link from the replica.

3.Begin and complete data recovery from the replica to the primary system.

If more than one primary system fails in an active/passive many-to-one topology, you need to fail the link from each failed system over to the replica. Multiple inbound links on the replica can be in the failed-over state at the same time.

When the failed systems become available again, you can restore the links at any time. However, you can perform data recovery on only one link at a time. (You don’t need to wait for all the failed systems to become available before recovering data to the first one.)

For example, suppose systems A, B, and C all replicate to system D. If both A and B fail, you can return to normal replication with these steps after the failed systems become available again:

1.On system D, restore the link from A to D.

2.On system D, begin and complete data recovery from D to A.

3.On system D, restore the link from B to D.

4.After the recovery of data from D to A is complete and replication from A to D has restarted, on system D, begin and complete data recovery from D to B.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Failover and failback in an active/passive chained topology


In an active/passive chained replication topology, one HCP system replicates to a second HCP system, which replicates to a third HCP system. For an explanation of this topology, see Active/passive chained replication.

The way you manage failover and failback in an active/passive chained replication topology depends on which system or systems have failed. The links from the first system in the chain to the second system and from the second system to the third system function independently of each other, but order matters when you fail them over or restore and perform data recovery on them.

This section of the Help outlines the steps you need to take to return an active/passive replication chain to normal operation after the failure of any one or two of the systems in the chain. These sections assume a replication topology in which, when all three systems are healthy:

System A replicates to system B on link AB.

Link AB includes HCP tenant T1 and default-namespace directory D1, both of which were originally created on system A.

System B replicates to system C on link BC.

Link BC includes link AB and HCP tenant T2, which was originally created on system B.

T1 and D1 are read-write on system A and read-only on systems B and C.

T2 is read-write on system B and read-only on system C.

The figure below shows this topology.

1_1_1.jpg

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Scenario: System A fails


To return to normal operation after system A fails:

1.On system B, fail link AB over to B.

T1 and D1 become read-write on B.

2.When system A becomes available again, on system B, restore link AB.

3.On system A, accept the restored link.

4.On system B, begin and complete data recovery on link AB.

When data recovery is complete, T1 and D1 become read-write on A and read-only on B, and replication resumes on link AB.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Scenario: System B fails


To return to normal operation after system B fails:

1.On system C, fail link BC over to C.

T1 and D1 remain read-write on A and read-only on C. T2 becomes read-write on C.

2.When system B becomes available again, on system A, restore link AB.

3.On system B, accept the restored link.

T1 and D1 are read-write on A and read-only on B and C, and replication restarts on link AB.

4.On system C, restore link BC.

5.On system B, accept the restored link.

6.On system C, begin and complete data recovery on link BC.

When data recovery is complete, T1 and D1 are read-only on B and C. T2 becomes read-write on B and read-only on C, and replication resumes on link BC.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Scenario: System C fails


To return to normal operation after system C fails:

1.When system C becomes available again, on system B, restore link BC.

2.On system C, accept the restored link.

T1 and D1 remain read-only on B and C. T2 remains read-write on B and read-only on C, and replication restarts on link BC.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Scenario: Systems A and B fail


To return to normal operation after systems A and B fail:

1.On system C, fail link BC over to C.

T1 and D1 remain read-only on C. T2 becomes read-write on C.

2.When system B becomes available again, on system C, restore link BC.

3.On system B, accept the restored link.

4.On system C, begin and complete data recovery on link BC.

When data recovery is complete, T1 and D1 are read-only on B and C. T2 becomes read-write on B and read-only on C, and replication resumes on link BC.

System B automatically recreates link AB without the primary system IP addresses or hostname from the inbound link AB in link BC.

5.When system A becomes available again, take either of these actions:

oIf link AB still exists on A:

1.On system A, restore link AB.

2.On system B, accept the restored link.

3.On system B, begin and complete data recovery on link AB.

When data recovery is complete, T1 and D1 become read-write on A and remain read-only on B, and replication resumes on link AB.

oIf link AB no longer exists on A, follow the steps for option one or option two in the table below.

Option one Option two
1. On system B, update the configuration of the automatically recreated link AB to include the IP addresses or hostname for system A.
1. On system B, suspend and then delete link AB.

T1 and D1 become directly included on link BC, which still includes T2, and become read-write on B.

2. On system B, restore link AB.
2. Optionally, create a new replication chain BCA:
a. Reinstall HCP on A.
b. On system C, create outbound link CA, including link BC as an inbound link.
c. On system A, accept the new link.

T1, D1, and T2 are read-write on B and readonly on C and A, and replication starts on link CA.

3. On system A, accept the restored link.
N/A
4. On system B, begin and complete data recovery on link AB.

When data recovery is complete, T1 and D1 are read-write on A and read-only on B and C, and replication restarts on link AB.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Scenario: Systems A and C fail


To return to normal operation after systems A and C fail:

1.On system B, fail link AB over to B.

T1 and D1 become read-write on B. T2 remains read-write on B.

2.When system C becomes available again, on system B, restore link BC.

3.On system C, accept the restored link.

T1, D1, and T2 are read-write on B and read-only on C, and replication restarts on link BC.

4.When system A becomes available again, on system B, restore link AB.

5.On system A, accept the restored link.

6.On system B, begin and complete data recovery on link AB.

When data recovery is complete, T1 and D1 become read-write on A and read-only on B, and replication resumes on link AB.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Scenario: Systems B and C fail


To return to normal operation after systems B and C fail:

1.When system B becomes available again, on system A, restore link AB.

2.On system B, accept the restored link.

T1 and D1 are read-write on A and read-only on B, and replication restarts.

3.When system C becomes available again, if T1 and D1 still exist on system C, delete them. (This requires that all objects in the namespaces owned by T1 and in D1 be deleted first.)

NoteWebHelp.png

Note: If T2 still exists on systems B and C, delete it from C. If T2 still exists on system C and not on system B, you can recover it to B by creating a link from C to B. However, you cannot then replicate T2 from B to C unless you first delete it from C.

4.On system B, create outbound link BC, including link AB as an inbound link. If T2 still exists on B, also include T2 in the link.

5.On system C, accept the new link.

T1 and D1 are read-only on B and C. If T2 is on link BC, T2 is read-write on B and read-only on C. Replication restarts on link BC.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Failover and failback in an active/passive one-to-many topology


In an active/passive one-to many replication topology, one HCP system replicates to two or more other systems. For an explanation of this topology, see Active/passive many-to-one replication.

If one of the replicas fails, you follow the normal pattern for recovering from a replica failure. If more than one replica fails, you follow the normal recovery pattern for each replication link individually. The order in which you perform the recovery procedures doesn’t matter.

Throughout these failure scenarios, the HCP tenants and namespaces and default-namespace directories on each link remain read-write on the primary system and read-only on the replicas. Therefore, even if the links include the same items, no conflicts can occur.

However, if two or more links include the same HCP tenants and namespaces and default-namespace directories and the primary system fails, these items can be read-write on multiple systems at the same time. This can lead to conflicts during data recovery. For information on how HCP handles conflicts that occur during data recovery, see Replication collisions.

For example, assume a replication topology in which, when all three systems are healthy:

System A replicates to system B on link AB. Link AB includes HCP tenant T1.

System A replicates to system C on link AC. Link AC also includes HCP tenant T1.

To return to normal operation after system A fails:

1.On system B, fail link AB over to B.

T1 becomes read-write on B and read-only on A and remains read-only on C.

2.On system C, fail link AC over to C.

T1 becomes read-write on C and remains read-only on A and read-write on B. T1 is now read-write on two systems.

TipWebHelp.png

Tip: To prevent recovery conflicts, ensure that clients write to only system B or only system C while both systems are read-write.

3.When system A becomes available again, on system B, restore link AB. (You could restore link AC first. The order in which you restore the links doesn’t matter.)

4.On system A, accept the restored link.

5.On system B, begin and complete data recovery on link AB.

When data recovery is complete, T1 remains read-only on A because link AC is stilled failed over to C. It becomes read-only on B and remains read-write on C. Replication resumes on link AB.

6.When data recovery on link AB is complete, on system C, restore link AC.

7.On system A, accept the restored link.

8.On system C, begin and complete data recovery on link AC.

T1 becomes read-write on A and read-only on C and remains read-only on B. Replication resumes on link AC.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.

Failover and failback in an active/passive many-to-one topology with disaster recovery support


When HCP systems fail in an active/passive many-to-one topology with disaster recovery support, you need to combine the failback patterns for the many-to-one and chained topologies. For an explanation of the active/passive many-to-one topology with disaster recovery support, see Many-to-one replication with disaster recovery support.

For example, assume a replication topology in which, when all five systems are healthy:

System A replicates to system D on link AD. Link AD includes HCP tenant T1, which was originally created on system A.

System B replicates to system D on link BD. Link BD includes HCP tenant T2, which was originally created on system B.

System C replicates to system D on link CD. Link CD includes HCP tenant T3, which was originally created on system C.

System D replicates to system E on link DE. Link DE includes links AD, BD, and CD and tenant T4, which was originally created on system D.

T1, T2, and T3 are read-write on systems A, B, and C, respectively, and read-only on systems D and E.

T4 is read-write on system D and read-only on system E.

The figure below shows this topology.

1_27_1.jpg

To return to normal operation after systems A, B, and D fail:

1.On system E, fail link DE over to E.

T1, T2, and T3 are read-only on E. T4 becomes read-write on E. T3 remains read-write on C.

2.When system D becomes available again, on system E, restore link DE.

3.On system D, accept the restored link.

4.On system E, begin and complete data recovery on link DE.

When data recovery is complete, T1, T2, and T3 are read-only on D and E. T4 becomes read-write on D and read-only on E, and replication resumes on link DE.

5.On system C, restore link CD.

6.On system D, accept the restored link.

T3 is read-write on C and read-only on D, and replication restarts on link CD.

7.When system A becomes available again, on system D, update the configuration of the automatically recreated link AD.

8.On system D, restore link AD.

9.On system A, accept the restored link.

10.On system D, begin and complete data recovery on link AD.

When data recovery is complete, T1 becomes read-write on A and remains read-only on D. Replication resumes on link AD.

11.When system B becomes available again and data recovery on link AD is complete, on system D, update the configuration of the automatically recreated link BD.

12.On system D, restore link BD.

13.On system B, accept the restored link.

14.On system D, begin and complete data recovery on link BD.

When data recovery is complete, T2 becomes read-write on B and remains read-only on D. Replication resumes on link BD.

© 2015, 2019 Hitachi Vantara Corporation. All rights reserved.