Skip to main content

We've Moved!

Product Documentation has moved to docs.hitachivantara.com
Hitachi Vantara Knowledge

Recovering from a failure

During a catastrophic failure, an HCP system can lose all configuration information. In this case, if the system participates in a replication link, you need to restore the link configuration after the system is rebuilt. If the failure did not cause the system to lose the link configuration, you don’t need to restore the link.

Once the link configuration exists on both systems involved in the link:

  • For an active/active link, you need to perform the failback procedure.
  • For an active/passive link:
    • If the primary system failed, you need to recover namespace content and other applicable information from the replica.
      NoteAfter you fail over an active/passive link, the only way to return to normal replication is by going through the data recovery procedure. You need to perform this procedure even if you don’t need to restore the link and even if no changes have been made to the configuration or content of the replicated items. Even when nothing has change, the data recovery process can take more than five minutes.
    • If the replica failed, replication automatically restarts, beginning again with the objects with the oldest metadata changes either across all namespaces or within each namespace, depending on the link configuration.

Restoring a link

You do not need to restore a replication link if both systems involved in the link have the correct link configuration. In this scenario, you can skip this portion of the recovery procedure even though the Restore Link button is enabled.

NoteClicking Restore Link will cause the pending recovery to attempt to replicate all object data over the restored link. If the failover and recovery procedures are part of a failover and recover test, clicking this button might cause the recovery procedure to take a lot more time than necessary.

Before you begin

Before you can restore a replication link, both systems involved in the link must have the required SSL certificates installed.

Procedure

  1. In the top-level menu of the HCP System Management Console for the system where the link configuration still exists, select Services Replication.

  2. On the replication Links page, click the link you want to restore.

  3. On the Replication Link Details page, click Link.

  4. If necessary, update the link configuration.

    You must complete this step only if the domain name or IP addresses of the other system has changed. If the domain name of the other system has not changed, do not replace the displayed IP addresses with the domain name.
    NoteAfter a system failure that is part of an erasure coding topology, if the replacement system has a different domain name or different IP addresses from the failed system, you need to reconfigure all the replication links that connect other systems in the topology to the failed system. When reconfiguring the links, ensure that they all connect to the same replacement or rebuilt system; otherwise, the erasure coding topology will no longer function.
  5. In the replication Link panel, click the Failover tab.

  6. In the Failover pane, click Restore Link.

    NoteIf you do not see the Restore Link button, wait a few minutes and then refresh the Replication page. Before displaying this button, HCP must recognize that the other system is available, has the necessary SSL certificates installed, and is unaware of the existing link.

Next steps

After you restore a link, the Restore Link button remains active. This state enables you to repeat the restore procedure if the system that failed needs to be rebuilt again before you fail back or begin recovery on the link. Although the button is still active, you can proceed to the next part of the recovery procedure.

Failing back an active/active link

  1. In the top-level menu of the HCP System Management Console for the system you failed over to, select Services Replication.

  2. On the replication Links page, click the link you want to fail back.

  3. On the replication link details page, click Link.

  4. In the replication Link panel, click the Failover tab.

  5. In the link Failover panel, click Fail Back.

Recovering the data after a primary system failure

  1. In the top-level menu of the HCP System Management Console for the replica, select Services Replication.

  2. On the replication Links page, click the link on which you want to recover data.

  3. On the replication link details page, click Link.

  4. In the replication Link panel, click the Failover tab.

  5. In the link Failover panel, click Begin Recovery.

    NoteAfter uploading new trusted replication server certificates, you may need to wait more than ten minutes for the Begin Recovery button to become active.

    The applicable HCP tenants and default-namespace directories become read-only on the primary system, and the Replication service starts copying the applicable objects and configuration information from the replica to the primary system. As with replication from the primary system to the replica, the service starts with the objects with the oldest metadata changes either across all namespaces or within each namespace, depending on the link configuration.

    NoteIf the primary system cannot communicate with Active Directory and either of these is true for a tenant, recovery of that tenant is automatically paused:
    • The tenant to be recovered supports AD authentication.
    • A namespace owned by the tenant supports AD single sign-on.

    When communication between the primary system and AD is restored, you can resume recovery of the tenant.

  6. Monitor the recovery process by periodically reviewing the information in the status Overview and status Tenants panels for the link.

  7. When data recovery is almost synchronized with current tenant and namespace activities on the replica, return to the Failover panel for the link.

    Synchronization is nearing completion when the up-to-date-as-of time for the link is close to zero.
    NoteAs long as clients continue writing to the replica, synchronization won’t reach one hundred percent. Synchronization doesn’t need to be completely up to date for you to start the complete recovery phase.
  8. In the link Failover panel, click Complete Recovery.

    The applicable tenants and directories on the replica immediately become read-only. The tenants and directories on both systems then remain read-only until the Replication service finishes the data recovery. The amount of time this takes depends on how much data is left to recover.

    When recovery is complete, the tenants and directories on the primary system become read-write, those on the replica remain read-only, and the Replication service on the primary system starts copying objects to the replica again.

    Tip
    • You can schedule completion of the data recovery process for a time when client usage of the repository is low.
    • If, before final recovery is complete, you need to allow clients to write to the applicable tenants and directories on the replica again, click Cancel Final Recovery in the link Failover panel. The recovery process continues, but the applicable tenants and directories become read-write on the replica and remain read-only on the primary system until you click Complete Recovery again.
  9. If DNS failover is disabled:

    1. Wait for this message to appear in the system log:

      Replication data recovery completed
    2. Tell the applicable tenant administrators to redirect all client access requests to the primary system.

 

  • Was this article helpful?