This guide provides recommended best practices for configuring disaster recovery (DR) replication in Nutanix environments. It is designed to help you optimize your data protection strategy using Nutanix's native DR capabilities found in Prism Central.
Nutanix Guest Tools (NGT)
NGT enables application-consistent snapshots, beyond the default crash-consistent snapshots, and facilitates seamless IP management during failover events.
NGT Installation
NGT is strongly recommended to avoid the need for later installation across many servers and to mitigate networking issues during failover.
Availability Zones
Paired availability zones synchronize protection policies, recovery plans and categories used in protection policies and recovery plans.
Always create protection policies, recovery plans, and categories on the primary cluster.
Primary = Replication Source
Secondary = Replication Target
RPO Setting
We recommend Async with a 1-hour recovery point objective (RPO).
NearSync may also be configured with an RPO as low as 15 minutes.
Synchronous replication is not supported due to extremely stringent latency requirements.
Storage Containers
Storage Container Naming
The storage containers between primary and recovery clusters must have the same name. Mismatched names could lead to replication issues if data reduction features are not enabled on the selected containers.
For VMs that haven't been replicated before, create a container with the same name on both sides. If the VM was in a protection domain before, it continues to use the same container.
Protection Policies
Protection policies should be applied using categories, ensuring they meet the following guidelines:
If using NearSync replication, limited your policies to fewer than 200 VMs.
We recommend utilizing "Roll-up" retention types for efficient data protection.
Roll-Up Retention combines the oldest snapshots for the specified RPO interval into a single snapshot when the next higher interval is reached, all the way up to the retention period specified for a site.
Linear Retention implements a retention scheme at both sites (local and remote). If you set the retention number for a given site to ‘n,’ the site retains the ‘n’ most recent snapshots.
Example: If the RPO is 1 hour and the retention number for the local site is 48, the 48 most recent snapshots are retained at any given time.
Creating Categories
Prism Central includes built-in categories for frequently encountered applications (for example, MS Exchange and Oracle). If the category or value you want is not available, first create the category with the required values, or update an existing category so that it has the values you require.
You can add entities to the category either before or after you configure the protection policy. If the entities have a common characteristic, such as belonging to a specific application or location, create a category and add the entities into the category.
Recovery Plans
Create your recovery plans based on the following best practices. Review the DR configuration maximums chart below as well.
Always configure Async and NearSync replicated VMs on separate plans.
Limit recovery plans to 30 VMs for faster failover, as only 10 VMs can fail over simultaneously.
Implement non-routable networks for testing to prevent interference with production environments.
If Expedient is the recovery target for DR the test (bubble) network can be created in the VPC using a non-routable network/subnet such as APIPA (169.254.0.0)
Use different recovery plans for asynchronous and synchronous replication VMs, as synchronous recovery does not support planned failover.
Always Validate
Run the Validate workflow after making changes and use the Test and Clean-Up workflows to manage VM states during trials.
After you run the Test workflow, run the Clean-Up workflow instead of manually deleting VMs.
Backups
Third-Party Backups
If using backup solutions like Cohesity or Commvault, disable backups before a DR failover. This preemptive step prevents snapshot-induced data corruption during the transition.
Backup Continuity
In event of a failover where the primary site is offline or no longer available, it is recommended to create a new protection policy to take local snapshots/recovery points locally for production VMs.
Note that additional snapshots will consume storage space on the local cluster.
Nutanix DR Configuration Maximums
Name | Limit | Description |
---|---|---|
AZ: Guest VMs/VGs | 2000 | Total Number of Guest VMs/VGs per AZ between any combination of PC's |
Category: Guest VMs/VGs | 1000 | Number of Guest VMs/VGs per category |
Consistency Group: Guest VMs/VGs | 30 | Number of Guest VMs/VGs per consistency group |
Consistency Group: Number of virtual disks in one CG | 1000 | Number of virtual disks per consistency group |
ProtectionPolicy: Categories | 60 | Number of Guest VM/VGs categories per protection policy |
ProtectionPolicy: Guest VMs/VGs | 2000 | Number of Guest VMs/VGs per protection policy. |
RecoveryPlan: Categories | 60 | Number of VM/VG categories per recovery plan. |
RecoveryPlan: Categories in a stage | 60 | Number of VM/VG categories per stage per recovery plan. |
RecoveryPlan: Failover of Guest VMs/VGs in Parallel | 1000 | Number of Guest VMs/VGs that can failover in parallel across all recovery plans |
RecoveryPlan: Guest VMs/VGs by Category | 1000 | Number of Guest VMs/VGs added by VM/VG category per recovery plan. |
RecoveryPlan: Guest VMs/VGs by Name | 1000 | Number of Guest VMs/VGs added by VM/VG name per recovery plan. |
RecoveryPlan: Parallel Jobs | 10 | Number of Recovery plans that can run in parallel. The number of VMs/VGs in all the running recovery plans must not exceed max limit. |
RecoveryPlan: Stages | 60 | Number of Stages per recovery plan. (applicable for VM only) |