Nutanix Disaster Recovery Best Practices

This guide provides recommended best practices for configuring disaster recovery (DR) replication in Nutanix environments. It is designed to help you optimize your data protection strategy using Nutanix's native DR capabilities found in Prism Central.

Nutanix Guest Tools (NGT)

NGT enables application-consistent snapshots, beyond the default crash-consistent snapshots, and facilitates seamless IP management during failover events.

NGT Installation
NGT is strongly recommended to avoid the need for later installation across many servers and to mitigate networking issues during failover.

Availability Zones

Paired availability zones synchronize protection policies; recovery plans and categories used in protection policies and recovery plans.

Always create protection policies, recovery plans, and categories on the primary cluster.

Primary = Replication Source
Secondary = Replication Target

RPO Setting

We recommend Async with a 1-hour recovery point objective (RPO).
NearSync may also be configured with an RPO as low as 15 minutes.
Synchronous replication is not supported due to extremely stringent latency requirements.

Storage Containers

Storage Container Naming
The storage containers between primary and recovery clusters must have the same name. Mismatched names could lead to replication issues if data reduction features are not enabled on the selected containers.

For VMs that haven't been replicated before, create a container with the same name on both sides. If the VM was in a protection domain before, it continues to use the same container.

Protection Policies

Protection policies should be applied using categories, ensuring they meet the following guidelines:

If using NearSync replication, limited your policies to fewer than 200 VMs.
We recommend utilizing "Roll-up" retention types for efficient data protection.
- Roll-Up Retention combines the oldest snapshots for the specified RPO interval into a single snapshot when the next higher interval is reached, all the way up to the retention period specified for a site.
- Linear Retention implements a retention scheme at both sites (local and remote). If you set the retention number for a given site to ‘n,’ the site retains the ‘n’ most recent snapshots.
  - Example: If the RPO is 1 hour and the retention number for the local site is 48, the 48 most recent snapshots are retained at any given time.

Creating Categories

Prism Central includes built-in categories for frequently encountered applications (for example, MS Exchange and Oracle). If the category or value you want is not available, first create the category with the required values, or update an existing category so that it has the values you require.

You can add entities to the category either before or after you configure the protection policy. If the entities have a common characteristic, such as belonging to a specific application or location, create a category and add the entities into the category.

Recovery Plans

Create your recovery plans based on the following best practices. Review the DR configuration maximums chart below as well.

Always configure Async and NearSync replicated VMs on separate plans.
Implement non-routable networks for testing to prevent interference with production environments.
- If Expedient is the recovery target for DR the test (bubble) network can be created in the VPC using a non-routable network/subnet such as APIPA (169.254.0.0)
Use different recovery plans for asynchronous and synchronous replication VMs, as synchronous recovery does not support planned failover.

Always Validate
Run the Validate workflow after making changes and use the Test and Clean-Up workflows to manage VM states during trials.
After you run the Test workflow, run the Clean-Up workflow instead of manually deleting VMs.

Recovery Plan Limitations / Recommendations

We recommended keeping your totals under the limits listed below:

Max Recovery Plans executing at once:
- With Prism Central Small, Medium, or Single Node Large: 5
- With Prism Central Large Scale-Out, XL Scale-Out: 10
  - XL Scale-Out is the current Expedient standard
VMs restoring simultaneously, across all active recovery plan jobs:
- With Prism Central Small, Medium, or Single Node Large: 10
- With Prism Central Large Scale-Out, XL Scale-Out: 200
  - XL Scale-Out is the current Expedient standard

Backups

Third-Party Backups

If using backup solutions like Cohesity or Commvault, disable backups before a DR failover. This preemptive step prevents snapshot-induced data corruption during the transition.

Backup Continuity

In event of a failover where the primary site is offline or no longer available, it is recommended to create a new protection policy to take local snapshots/recovery points locally for production VMs.

Note that additional snapshots will consume storage space on the local cluster.

Recovery Plans in Action

What happens during the restore phase each recovery plan?

A Test Failover uses metadata to clone the latest snapshot and create a VM from it.
A Planned Failover will power off the source, take a new snapshot, replicate it to the target, then use the cloned snapshot and create a VM from it.
- ⚠️ This is the most time-consuming option and adds additional time for the new snapshot that would not occur during a Test Failover or Unplanned Failover.
An Unplanned Failover, similar to a Test Failover, uses metadata to clone the latest snap and create a VM from it.
- 🚨 It is NOT recommended to perform an Unplanned Failover unless you are in a true DR scenario. Unplanned failovers require manual cleanup on the source side.

Nutanix DR Configuration Maximums

Name	Limit	Description
AZ: Guest VMs/VGs	2000	Total Number of Guest VMs/VGs per AZ between any combination of PC's
Category: Guest VMs/VGs	1000	Number of Guest VMs/VGs per category
Consistency Group: Guest VMs/VGs	30	Number of Guest VMs/VGs per consistency group
Consistency Group: Number of virtual disks in one CG	1000	Number of virtual disks per consistency group
ProtectionPolicy: Categories	60	Number of Guest VM/VGs categories per protection policy
ProtectionPolicy: Guest VMs/VGs	2000	Number of Guest VMs/VGs per protection policy.
RecoveryPlan: Categories	60	Number of VM/VG categories per recovery plan.
RecoveryPlan: Categories in a stage	60	Number of VM/VG categories per stage per recovery plan.
RecoveryPlan: Failover of Guest VMs/VGs in Parallel	1000	Number of Guest VMs/VGs that can failover in parallel across all recovery plans
RecoveryPlan: Guest VMs/VGs by Category	1000	Number of Guest VMs/VGs added by VM/VG category per recovery plan.
RecoveryPlan: Guest VMs/VGs by Name	1000	Number of Guest VMs/VGs added by VM/VG name per recovery plan.
RecoveryPlan: Parallel Jobs	10	Number of Recovery plans that can run in parallel. The number of VMs/VGs in all the running recovery plans must not exceed max limit.
RecoveryPlan: Stages	60	Number of Stages per recovery plan. (applicable for VM only)