🧾 Azure BCDR – How I Turned a DR Review into a Strategic Recovery Plan

In Part 1 of this series, I shared how we reviewed our Azure BCDR posture after inheriting a partially implemented cloud estate. The findings were clear: while the right tools were in place, the operational side of disaster recovery hadn’t been addressed.

There were no test failovers, no documented Recovery Plans, no automation, and several blind spots in DNS, storage, and private access.

This post outlines how I took that review and turned it into a practical recovery strategy — one that we could share internally, align with our CTO, and use as a foundation for further work with our support partner.

To provide context, our estate is deployed primarily in the UK South Azure region, with UK West serving as the designated DR target region.

It’s not a template — it’s a repeatable, real-world approach to structuring a BCDR plan when you’re starting from inherited infrastructure, not a clean slate.

🧭 1. Why Documenting the Plan Matters

Most cloud teams can identify issues. Fewer take the time to formalise the findings in a way that supports action and alignment.

Documenting our BCDR posture gave us three things:

🧠 Clarity — a shared understanding of what’s protected and what isn’t
🔦 Visibility — a way to surface risk and prioritise fixes
🎯 Direction — a set of realistic, cost-aware next steps

We weren’t trying to solve every problem at once. The goal was to define a usable plan we could act on, iterate, and eventually test — all while making sure that effort was focused on the right areas.

🧱 2. Starting the Document

I structured the document to speak to both technical stakeholders and senior leadership. It needed to balance operational context with strategic risk.

✍️ Core sections included

Executive Summary – what the document is, why it matters
Maturity Snapshot – a simple traffic-light view of current vs target posture
Workload Overview – what’s in scope and what’s protected
Recovery Objectives – realistic RPO/RTO targets by tier
Gaps and Risks – the areas most likely to cause DR failure
Recommendations – prioritised, actionable, and cost-aware
Next Steps – what we can handle internally, and what goes to the MSP

Each section followed the same principle: clear, honest, and focused on action. No fluff, no overstatements — just a straightforward review of where we stood and what needed doing.

🧩 3. Defining the Current State

Before we could plan improvements, we had to document what actually existed. This wasn’t about assumptions — it was about capturing the real configuration and coverage in Azure.

🗂️ Workload Inventory

We started by categorising all VMs and services:

Domain controllers
Application servers (web/API/backend)
SQL Managed Instances
Infrastructure services (file, render, schedulers)
Management and monitoring VMs

Each workload was mapped by criticality and recovery priority — not just by type.

🛡️ Protection Levels

For each workload, we recorded:

✅ Whether it was protected by ASR
🔁 Whether it was backed up only
🚫 Whether it had no protection (with justification)

We also reviewed the geographic layout — e.g. which services were replicated into UK West, and which existed only in UK South.

🧠 Supporting Services

Beyond the VMs, we looked at:

Identity services (AD, domain controllers, replication health)
DNS architecture (AD-integrated zones, private DNS zones)
Private Endpoints and their region-specific availability
Storage account replication types (LRS, RA-GRS, ZRS)
Network security and routing configurations in DR

The aim wasn’t to build a full asset inventory — just to gather enough visibility to start making risk-based decisions about what mattered, and what was missing.

⏱️ 4. Setting Recovery Objectives

Once the current state was mapped, the next step was to define what “recovery” should actually look like — in terms that could be communicated, challenged, and agreed.

We focused on two key metrics:

RTO (Recovery Time Objective): How long can this system be offline before we see significant operational impact?
RPO (Recovery Point Objective): How much data loss is acceptable in a worst-case failover?

These weren’t guessed or copied from a template. We worked with realistic assumptions based on our tooling, team capability, and criticality of the services.

📊 Tiered Recovery Model

Each workload was assigned to one of four tiers:

Tier	Description
Tier 0	Core infrastructure (identity, DNS, routing)
Tier 1	Mission-critical production workloads
Tier 2	Important, but not time-sensitive services
SQL MI	Treated separately due to its PaaS nature

We then applied RTO and RPO targets based on what we could achieve today vs what we aim to reach with improvements.

🔥 Heatmap Example

Workload Tier	RPO (Current)	RTO (Current)	RPO (Optimised)	RTO (Optimised)
Tier 0 – Identity	5 min	60 min	5 min	30 min
Tier 1 – Prod	5 min	360 min	5 min	60 min
Tier 2 – Non-Crit	1440 min	1440 min	60 min	240 min
SQL MI	0 min	60 min	0 min	30 min

🚧 5. Highlighting Gaps and Risks

With recovery objectives defined, the gaps became much easier to identify — and to prioritise.

We weren’t trying to protect everything equally. The goal was to focus attention on the areas that introduced the highest risk to recovery if left unresolved.

⚠️ What We Flagged

❌ No test failovers had ever been performed
❌ No Recovery Plans existed
🌐 Public-facing infrastructure only existed in one region
🔒 Private Endpoints lacked DR equivalents
🧭 DNS failover was manual or undefined
💾 Storage accounts had inconsistent replication logic
🚫 No capacity reservations existed for critical VM SKUs

Each gap was documented with its impact, priority, and remediation options.

🛠️ 6. Strategic Recommendations

We split our recommendations into what we could handle internally, and what would require input from our MSP or further investment.

📌 Internal Actions

Build and test Recovery Plans for Tier 0 and Tier 1 workloads
Improve DNS failover scripting
Review VM tags to reflect criticality and protection state
Create sequencing logic for application groups
Align NSGs and UDRs in DR with production

🤝 MSP-Led or Partner Support

Duplicate App Gateways / ILBs in UK West
Implement Private DNS Zones
Review and implement capacity reservations
Test runbook-driven recovery automation
Conduct structured test failovers across service groups

📅 7. Making It Actionable

A plan needs ownership and timelines. We assigned tasks by role and defined short-, medium-, and long-term priorities using a simple planning table.

We treat the BCDR document as a living artefact — updated quarterly, tied to change control, and used to guide internal work and partner collaboration.

🔚 8. Closing Reflections

The original goal wasn’t to build a perfect DR solution — it was to understand where we stood, make recovery realistic, and document a plan that would hold up when we needed it most.

We inherited a functional technical foundation — but needed to formalise and validate it as part of a resilient DR posture.

By documenting the estate, defining recovery objectives, and identifying where the real risks were, we turned a passive DR posture into something we could act on. We gave stakeholders clarity. We gave the support partner direction. And we gave ourselves a roadmap.

🔜 What’s Next

In the next part of this series, I’ll walk through how we executed the plan:

Building and testing our first Recovery Plan
Improving ASR coverage and validation
Running our first failover drill
Reviewing results and updating the heatmap

If you're stepping into an inherited cloud environment or starting your first structured DR review, I hope this gives you a practical view of what’s involved — and what’s achievable without overcomplicating the process.

Let me know if you'd like to see templates or report structures from this process in a future post.

Share on Share on