Skip to content

Azure

πŸ”„ Bringing Patch Management In-House: Migrating from MSP to Azure Update Manager

It's all fun and games until the MSP contract expires and you realise 90 VMs still need their patching schedules sorted…

With our MSP contract winding down, the time had come to bring VM patching back in house. Our third-party provider had been handling it with their own tooling, which would no longer be used when the service contract expired.

Enter Azure Update Manager β€” the modern, agentless way to manage patching schedules across your Azure VMs. Add a bit of PowerShell, sprinkle in some Azure Policy, and you've got yourself a scalable, policy-driven solution that's more visible, auditable, and way more maintainable.

Here's how I made the switch β€” and managed to avoid a patching panic.


βš™οΈ Prerequisites & Permissions

Let's get the plumbing sorted before diving in.

You'll need:

  • The right PowerShell modules:
Install-Module Az -Scope CurrentUser -Force
Import-Module Az.Maintenance, Az.Resources, Az.Compute
  • An account with Contributor permissions (or higher)
  • Registered providers to avoid mysterious error messages:
Register-AzResourceProvider -ProviderNamespace Microsoft.Maintenance
Register-AzResourceProvider -ProviderNamespace Microsoft.GuestConfiguration

Why Resource Providers? Azure Update Manager needs these registered to create the necessary API endpoints and resource types in your subscription. Without them, you'll get cryptic "resource type not found" errors.

Official documentation on Azure Update Manager prerequisites


πŸ•΅οΈ Step 1 – Audit the Current Setup

First order of business: collect the patching summary data from the MSP β€” which, helpfully, came in the form of multiple weekly CSV exports.

I used GenAI to wrangle the mess into a structured format. The result was a clear categorisation of VMs based on the day and time they were typically patched β€” a solid foundation to work from.


🧱 Step 2 – Create Seven New Maintenance Configurations

This is the foundation of Update Manager β€” define your recurring patch windows.

Click to expand: Create Maintenance Configurations (Sample Script)
# Azure Update Manager - Create Weekly Maintenance Configurations
# Pure PowerShell syntax

# Define parameters
$resourceGroupName = "rg-maintenance-uksouth-001"
$location = "uksouth"
$timezone = "GMT Standard Time"
$startDateTime = "2024-06-01 21:00"
$duration = "03:00"  # 3 hours - meets minimum requirement

# Day mapping for config naming (3-letter lowercase)
$dayMap = @{
    "Monday"    = "mon"
    "Tuesday"   = "tue" 
    "Wednesday" = "wed"
    "Thursday"  = "thu"
    "Friday"    = "fri"
    "Saturday"  = "sat"
    "Sunday"    = "sun"
}

# Create maintenance configurations for each day
foreach ($day in $dayMap.Keys) {
    $shortDay = $dayMap[$day]
    $configName = "contoso-maintenance-config-vms-$shortDay"

    Write-Host "Creating: $configName for $day..." -ForegroundColor Yellow

    try {
        $result = New-AzMaintenanceConfiguration `
            -ResourceGroupName $resourceGroupName `
            -Name $configName `
            -MaintenanceScope "InGuestPatch" `
            -Location $location `
            -StartDateTime $startDateTime `
            -Timezone $timezone `
            -Duration $duration `
            -RecurEvery "Week $day" `
            -InstallPatchRebootSetting "IfRequired" `
            -ExtensionProperty @{"InGuestPatchMode" = "User"} `
            -WindowParameterClassificationToInclude @("Critical", "Security") `
            -LinuxParameterClassificationToInclude @("Critical", "Security") `
            -Tag @{
                "Application"  = "Azure Update Manager"
                "Owner"        = "Contoso"
                "PatchWindow"  = $shortDay
            } `
            -ErrorAction Stop

        Write-Host "βœ“ SUCCESS: $configName" -ForegroundColor Green

        # Quick validation
        $createdConfig = Get-AzMaintenanceConfiguration -ResourceGroupName $resourceGroupName -Name $configName
        Write-Host "  Validated: $($createdConfig.RecurEvery) schedule confirmed" -ForegroundColor Gray

    } catch {
        Write-Host "βœ— FAILED: $configName - $($_.Exception.Message)" -ForegroundColor Red
        continue
    }
}

⚠️ Don't forget: duration format is ISO 8601, not "2 hours" β€” and start time has to match the day it's tied to.

Learn more about New-AzMaintenanceConfiguration


πŸ› οΈ Step 3 – Tweak the Maintenance Configs

Some patch windows felt too tight β€” and, just as importantly, I needed to avoid overlaps with existing backup jobs. Rather than let a large CU fail halfway through or run headlong into an Azure Backup job, I extended the duration on select configs and staggered them across the week:

$config = Get-AzMaintenanceConfiguration -ResourceGroupName "rg-maintenance-uksouth-001" -Name "contoso-maintenance-config-vms-sun"
$config.Duration = "04:00"
Update-AzMaintenanceConfiguration -ResourceGroupName "rg-maintenance-uksouth-001" -Name "contoso-maintenance-config-vms-sun" -Configuration $config

# Verify the change
$updatedConfig = Get-AzMaintenanceConfiguration -ResourceGroupName "rg-maintenance-uksouth-001" -Name "contoso-maintenance-config-vms-sun"
Write-Host "Sunday window now: $($updatedConfig.Duration) duration" -ForegroundColor Green

Learn more about Update-AzMaintenanceConfiguration


πŸ€– Step 4 – Use AI to Group VMs by Patch Activity

Armed with CSV exports of the latest patching summaries, I got AI to do the grunt work and make sense of the contents.

What I did:

  1. Exported MSP data: Weekly CSV reports showing patch installation timestamps for each VM
  2. Used Gen AI with various iterative prompts, starting the conversation with this:

    "Attached is an export summary of the current patching activity from our incumbent MSP who currently look after the patching of the VM's in Azure I need you to review timestamps and work out which maintenance window each vm is currently in, and then match that to the appropriate maintenance config that we have just created. If there are mis matches in new and current schedule then we may need to tweak the settings of the new configs"

  3. AI analysis revealed:

  4. 60% of VMs were patching on one weekday evening
  5. Several critical systems patching simultaneously
  6. No consideration for application dependencies

  7. AI recommendation: Spread VMs across weekdays based on:

  8. Criticality: Domain controllers on different days
  9. Function: Similar servers on different days (avoid single points of failure)
  10. Dependencies: Database servers before application servers

The result: A logical rebalancing that avoided "all our eggs in Sunday 1AM" basket and considered business impact.

Why this matters: The current patching schedule was not optimized for business continuity. AI helped identify risks we hadn't considered.


πŸ” Step 5 – Discover All VMs and Identify Gaps

Before diving into bulk tagging, I needed to understand what we were working with across all subscriptions.

First, let's see what VMs we have:

Click to expand: Discover Untagged VMs (Sample Script)
# Discover Untagged VMs Script for Azure Update Manager
# This script identifies VMs that are missing Azure Update Manager tags

$scriptStart = Get-Date

Write-Host "=== Azure Update Manager - Discover Untagged VMs ===" -ForegroundColor Cyan
Write-Host "Scanning all accessible subscriptions for VMs missing maintenance tags..." -ForegroundColor White
Write-Host ""

# Function to check if VM has Azure Update Manager tags
function Test-VMHasMaintenanceTags {
    param($VM)

    # Check for the three required tags
    $hasOwnerTag = $VM.Tags -and $VM.Tags.ContainsKey("Owner") -and $VM.Tags["Owner"] -eq "Contoso"
    $hasUpdatesTag = $VM.Tags -and $VM.Tags.ContainsKey("Updates") -and $VM.Tags["Updates"] -eq "Azure Update Manager"
    $hasPatchWindowTag = $VM.Tags -and $VM.Tags.ContainsKey("PatchWindow")

    return $hasOwnerTag -and $hasUpdatesTag -and $hasPatchWindowTag
}

# Function to get VM details for reporting
function Get-VMDetails {
    param($VM, $SubscriptionName)

    return [PSCustomObject]@{
        Name = $VM.Name
        ResourceGroup = $VM.ResourceGroupName
        Location = $VM.Location
        Subscription = $SubscriptionName
        SubscriptionId = $VM.SubscriptionId
        PowerState = $VM.PowerState
        OsType = $VM.StorageProfile.OsDisk.OsType
        VmSize = $VM.HardwareProfile.VmSize
        Tags = if ($VM.Tags) { ($VM.Tags.Keys | ForEach-Object { "$_=$($VM.Tags[$_])" }) -join "; " } else { "No tags" }
    }
}

# Initialize collections
$taggedVMs = @()
$untaggedVMs = @()
$allVMs = @()
$subscriptionSummary = @{}

Write-Host "=== DISCOVERING VMs ACROSS ALL SUBSCRIPTIONS ===" -ForegroundColor Cyan

# Get all accessible subscriptions
$subscriptions = Get-AzSubscription | Where-Object { $_.State -eq "Enabled" }
Write-Host "Found $($subscriptions.Count) accessible subscriptions" -ForegroundColor White

foreach ($subscription in $subscriptions) {
    try {
        Write-Host "`nScanning subscription: $($subscription.Name) ($($subscription.Id))" -ForegroundColor Magenta
        $null = Set-AzContext -SubscriptionId $subscription.Id -ErrorAction Stop

        # Get all VMs in this subscription
        Write-Host "  Retrieving VMs..." -ForegroundColor Gray
        $vms = Get-AzVM -Status -ErrorAction Continue

        $subTagged = 0
        $subUntagged = 0
        $subTotal = $vms.Count

        Write-Host "  Found $subTotal VMs in this subscription" -ForegroundColor White

        foreach ($vm in $vms) {
            $vmDetails = Get-VMDetails -VM $vm -SubscriptionName $subscription.Name
            $allVMs += $vmDetails

            if (Test-VMHasMaintenanceTags -VM $vm) {
                $taggedVMs += $vmDetails
                $subTagged++
                Write-Host "    βœ“ Tagged: $($vm.Name)" -ForegroundColor Green
            } else {
                $untaggedVMs += $vmDetails
                $subUntagged++
                Write-Host "    ⚠️ Untagged: $($vm.Name)" -ForegroundColor Yellow
            }
        }

        # Store subscription summary
        $subscriptionSummary[$subscription.Name] = @{
            Total = $subTotal
            Tagged = $subTagged
            Untagged = $subUntagged
            SubscriptionId = $subscription.Id
        }

        Write-Host "  Subscription Summary - Total: $subTotal | Tagged: $subTagged | Untagged: $subUntagged" -ForegroundColor Gray

    }
    catch {
        Write-Host "  βœ— Error scanning subscription $($subscription.Name): $($_.Exception.Message)" -ForegroundColor Red
        $subscriptionSummary[$subscription.Name] = @{
            Total = 0
            Tagged = 0
            Untagged = 0
            Error = $_.Exception.Message
        }
    }
}

Write-Host ""
Write-Host "=== OVERALL DISCOVERY SUMMARY ===" -ForegroundColor Cyan
Write-Host "Total VMs found: $($allVMs.Count)" -ForegroundColor White
Write-Host "VMs with maintenance tags: $($taggedVMs.Count)" -ForegroundColor Green
Write-Host "VMs missing maintenance tags: $($untaggedVMs.Count)" -ForegroundColor Red

if ($untaggedVMs.Count -eq 0) {
    Write-Host "οΏ½ ALL VMs ARE ALREADY TAGGED! οΏ½" -ForegroundColor Green
    Write-Host "No further action required." -ForegroundColor White
    exit 0
}

Write-Host ""
Write-Host "=== SUBSCRIPTION BREAKDOWN ===" -ForegroundColor Cyan
$subscriptionSummary.GetEnumerator() | Sort-Object Name | ForEach-Object {
    $sub = $_.Value
    if ($sub.Error) {
        Write-Host "$($_.Key): ERROR - $($sub.Error)" -ForegroundColor Red
    } else {
        $percentage = if ($sub.Total -gt 0) { [math]::Round(($sub.Tagged / $sub.Total) * 100, 1) } else { 0 }
        Write-Host "$($_.Key): $($sub.Tagged)/$($sub.Total) tagged ($percentage%)" -ForegroundColor White
    }
}

Write-Host ""
Write-Host "=== UNTAGGED VMs DETAILED LIST ===" -ForegroundColor Red
Write-Host "The following $($untaggedVMs.Count) VMs are missing Azure Update Manager maintenance tags:" -ForegroundColor White

# Group untagged VMs by subscription for easier reading
$untaggedBySubscription = $untaggedVMs | Group-Object Subscription

foreach ($group in $untaggedBySubscription | Sort-Object Name) {
    Write-Host "`nοΏ½ Subscription: $($group.Name) ($($group.Count) untagged VMs)" -ForegroundColor Magenta

    $group.Group | Sort-Object Name | ForEach-Object {
        Write-Host "  β€’ $($_.Name)" -ForegroundColor Yellow
        Write-Host "    Resource Group: $($_.ResourceGroup)" -ForegroundColor Gray
        Write-Host "    Location: $($_.Location)" -ForegroundColor Gray
        Write-Host "    OS Type: $($_.OsType)" -ForegroundColor Gray
        Write-Host "    VM Size: $($_.VmSize)" -ForegroundColor Gray
        Write-Host "    Power State: $($_.PowerState)" -ForegroundColor Gray
        if ($_.Tags -ne "No tags") {
            Write-Host "    Existing Tags: $($_.Tags)" -ForegroundColor DarkGray
        }
        Write-Host ""
    }
}

Write-Host "=== ANALYSIS BY VM CHARACTERISTICS ===" -ForegroundColor Cyan

# Analyze by OS Type
$untaggedByOS = $untaggedVMs | Group-Object OsType
Write-Host "`nοΏ½ Untagged VMs by OS Type:" -ForegroundColor White
$untaggedByOS | Sort-Object Name | ForEach-Object {
    Write-Host "  $($_.Name): $($_.Count) VMs" -ForegroundColor White
}

# Analyze by Location
$untaggedByLocation = $untaggedVMs | Group-Object Location
Write-Host "`nοΏ½ Untagged VMs by Location:" -ForegroundColor White
$untaggedByLocation | Sort-Object Count -Descending | ForEach-Object {
    Write-Host "  $($_.Name): $($_.Count) VMs" -ForegroundColor White
}

# Analyze by VM Size (to understand workload types)
$untaggedBySize = $untaggedVMs | Group-Object VmSize
Write-Host "`nοΏ½ Untagged VMs by Size:" -ForegroundColor White
$untaggedBySize | Sort-Object Count -Descending | Select-Object -First 10 | ForEach-Object {
    Write-Host "  $($_.Name): $($_.Count) VMs" -ForegroundColor White
}

# Analyze by Resource Group (might indicate application/workload groupings)
$untaggedByRG = $untaggedVMs | Group-Object ResourceGroup
Write-Host "`nοΏ½ Untagged VMs by Resource Group (Top 10):" -ForegroundColor White
$untaggedByRG | Sort-Object Count -Descending | Select-Object -First 10 | ForEach-Object {
    Write-Host "  $($_.Name): $($_.Count) VMs" -ForegroundColor White
}

Write-Host ""
Write-Host "=== POWER STATE ANALYSIS ===" -ForegroundColor Cyan
$powerStates = $untaggedVMs | Group-Object PowerState
$powerStates | Sort-Object Count -Descending | ForEach-Object {
    Write-Host "$($_.Name): $($_.Count) VMs" -ForegroundColor White
}

Write-Host ""
Write-Host "=== EXPORT OPTIONS ===" -ForegroundColor Cyan
Write-Host "You can export this data for further analysis:" -ForegroundColor White

# Export to CSV option
$timestamp = Get-Date -Format "yyyyMMdd-HHmm"
$csvPath = "D:\UntaggedVMs-$timestamp.csv"

try {
    $untaggedVMs | Export-Csv -Path $csvPath -NoTypeInformation
    Write-Host "βœ“ Exported untagged VMs to: $csvPath" -ForegroundColor Green
} catch {
    Write-Host "βœ— Failed to export CSV: $($_.Exception.Message)" -ForegroundColor Red
}

# Show simple list for easy copying
Write-Host ""
Write-Host "=== SIMPLE VM NAME LIST (for copy/paste) ===" -ForegroundColor Cyan
Write-Host "VM Names:" -ForegroundColor White
$untaggedVMs | Sort-Object Name | ForEach-Object { Write-Host "  $($_.Name)" -ForegroundColor Yellow }

Write-Host ""
Write-Host "=== NEXT STEPS RECOMMENDATIONS ===" -ForegroundColor Cyan
Write-Host "1. Review the untagged VMs list above" -ForegroundColor White
Write-Host "2. Investigate why these VMs were not in the original patching schedule" -ForegroundColor White
Write-Host "3. Determine appropriate maintenance windows for these VMs" -ForegroundColor White
Write-Host "4. Consider grouping by:" -ForegroundColor White
Write-Host "   β€’ Application/workload (Resource Group analysis)" -ForegroundColor Gray
Write-Host "   β€’ Environment (naming patterns, tags)" -ForegroundColor Gray
Write-Host "   β€’ Business criticality" -ForegroundColor Gray
Write-Host "   β€’ Maintenance window preferences" -ForegroundColor Gray
Write-Host "5. Run the tagging script to assign maintenance windows" -ForegroundColor White

Write-Host ""
Write-Host "=== AZURE RESOURCE GRAPH QUERY ===" -ForegroundColor Cyan
Write-Host "Use this query in Azure Resource Graph Explorer to verify results:" -ForegroundColor White
Write-Host ""
Write-Host @"
Resources
| where type == "microsoft.compute/virtualmachines"
| where tags.PatchWindow == "" or isempty(tags.PatchWindow) or isnull(tags.PatchWindow)
| project name, resourceGroup, subscriptionId, location, 
          osType = properties.storageProfile.osDisk.osType,
          vmSize = properties.hardwareProfile.vmSize,
          powerState = properties.extended.instanceView.powerState.displayStatus,
          tags
| sort by name asc
"@ -ForegroundColor Gray

Write-Host ""
Write-Host "Script completed at $(Get-Date)" -ForegroundColor Cyan
Write-Host "Total runtime: $((Get-Date) - $scriptStart)" -ForegroundColor Gray

Discovery results:

  • 35 VMs from the original MSP schedule (our planned list)
  • 12 additional VMs not in the MSP schedule (the "stragglers")
  • Total: 90 VMs needing Update Manager tags

Key insight: The MSP wasn't managing everything. Several dev/test VMs and a few production systems were missing from their schedule.


✍️ Step 6 – Bulk Tag All VMs with Patch Windows

Now for the main event: tagging all VMs with their maintenance windows. This includes both our planned VMs and the newly discovered ones.

🎯 Main VM Tagging (Planned Schedule)

Each tag serves a specific purpose:

  • PatchWindow β€” The key tag used by dynamic scopes to assign VMs to maintenance configurations
  • Owner β€” For accountability and filtering
  • Updates β€” Identifies VMs managed by Azure Update Manager
Click to expand: Multi-Subscription Azure Update Manager VM Tagging (Sample Script)
# Multi-Subscription Azure Update Manager VM Tagging Script
# This script discovers VMs across multiple subscriptions and tags them appropriately

Write-Host "=== Multi-Subscription Azure Update Manager - VM Tagging Script ===" -ForegroundColor Cyan

# Function to safely tag a VM
function Set-VMMaintenanceTags {
    param(
        [string]$VMName,
        [string]$ResourceGroupName,
        [string]$SubscriptionId,
        [hashtable]$Tags,
        [string]$MaintenanceWindow
    )

    try {
        # Set context to the VM's subscription
        $null = Set-AzContext -SubscriptionId $SubscriptionId -ErrorAction Stop

        Write-Host "  Processing: $VMName..." -ForegroundColor Yellow

        # Get the VM and update tags
        $vm = Get-AzVM -ResourceGroupName $ResourceGroupName -Name $VMName -ErrorAction Stop

        if ($vm.Tags) {
            $Tags.Keys | ForEach-Object { $vm.Tags[$_] = $Tags[$_] }
        } else {
            $vm.Tags = $Tags
        }

        $null = Update-AzVM -VM $vm -ResourceGroupName $ResourceGroupName -Tag $vm.Tags -ErrorAction Stop
        Write-Host "  βœ“ Successfully tagged $VMName for $MaintenanceWindow maintenance" -ForegroundColor Green

        return $true
    }
    catch {
        Write-Host "  βœ— Failed to tag $VMName`: $($_.Exception.Message)" -ForegroundColor Red
        return $false
    }
}

# Define all target VMs organized by maintenance window
$maintenanceGroups = @{
    "Monday" = @{
        "VMs" = @("WEB-PROD-01", "DB-PROD-01", "APP-PROD-01", "FILE-PROD-01", "DC-PROD-01")
        "Tags" = @{
            "Owner" = "Contoso"
            "Updates" = "Azure Update Manager"
            "PatchWindow" = "mon"
        }
    }
    "Tuesday" = @{
        "VMs" = @("WEB-PROD-02", "DB-PROD-02", "APP-PROD-02", "FILE-PROD-02", "DC-PROD-02")
        "Tags" = @{
            "Owner" = "Contoso"
            "Updates" = "Azure Update Manager"
            "PatchWindow" = "tue"
        }
    }
    "Wednesday" = @{
        "VMs" = @("WEB-PROD-03", "DB-PROD-03", "APP-PROD-03", "FILE-PROD-03", "DC-PROD-03")
        "Tags" = @{
            "Owner" = "Contoso"
            "Updates" = "Azure Update Manager"
            "PatchWindow" = "wed"
        }
    }
    "Thursday" = @{
        "VMs" = @("WEB-PROD-04", "DB-PROD-04", "APP-PROD-04", "FILE-PROD-04", "PRINT-PROD-01")
        "Tags" = @{
            "Owner" = "Contoso"
            "Updates" = "Azure Update Manager"
            "PatchWindow" = "thu"
        }
    }
    "Friday" = @{
        "VMs" = @("WEB-PROD-05", "DB-PROD-05", "APP-PROD-05", "FILE-PROD-05", "MONITOR-PROD-01")
        "Tags" = @{
            "Owner" = "Contoso"
            "Updates" = "Azure Update Manager"
            "PatchWindow" = "fri"
        }
    }
    "Saturday" = @{
        "VMs" = @("WEB-DEV-01", "DB-DEV-01", "APP-DEV-01", "TEST-SERVER-01", "SANDBOX-01")
        "Tags" = @{
            "Owner" = "Contoso"
            "Updates" = "Azure Update Manager"
            "PatchWindow" = "sat-09"
        }
    }
    "Sunday" = @{
        "VMs" = @("WEB-UAT-01", "DB-UAT-01", "APP-UAT-01", "BACKUP-PROD-01", "MGMT-PROD-01")
        "Tags" = @{
            "Owner" = "Contoso"
            "Updates" = "Azure Update Manager"
            "PatchWindow" = "sun"
        }
    }
}

# Function to discover VMs across all subscriptions
function Find-VMsAcrossSubscriptions {
    param([array]$TargetVMNames)

    $subscriptions = Get-AzSubscription | Where-Object { $_.State -eq "Enabled" }
    $vmInventory = @{}

    foreach ($subscription in $subscriptions) {
        try {
            $null = Set-AzContext -SubscriptionId $subscription.Id -ErrorAction Stop
            $vms = Get-AzVM -ErrorAction Continue

            foreach ($vm in $vms) {
                if ($vm.Name -in $TargetVMNames) {
                    $vmInventory[$vm.Name] = @{
                        Name = $vm.Name
                        ResourceGroupName = $vm.ResourceGroupName
                        SubscriptionId = $subscription.Id
                        SubscriptionName = $subscription.Name
                        Location = $vm.Location
                    }
                }
            }
        }
        catch {
            Write-Host "Error scanning subscription $($subscription.Name): $($_.Exception.Message)" -ForegroundColor Red
        }
    }

    return $vmInventory
}

# Get all unique VM names and discover their locations
$allTargetVMs = @()
$maintenanceGroups.Values | ForEach-Object { $allTargetVMs += $_.VMs }
$allTargetVMs = $allTargetVMs | Sort-Object -Unique

Write-Host "Discovering locations for $($allTargetVMs.Count) target VMs..." -ForegroundColor White
$vmInventory = Find-VMsAcrossSubscriptions -TargetVMNames $allTargetVMs

# Process each maintenance window
$totalSuccess = 0
$totalFailed = 0

foreach ($windowName in $maintenanceGroups.Keys) {
    $group = $maintenanceGroups[$windowName]
    Write-Host "`n=== $windowName MAINTENANCE WINDOW ===" -ForegroundColor Magenta

    foreach ($vmName in $group.VMs) {
        if ($vmInventory.ContainsKey($vmName)) {
            $vmInfo = $vmInventory[$vmName]
            $result = Set-VMMaintenanceTags -VMName $vmInfo.Name -ResourceGroupName $vmInfo.ResourceGroupName -SubscriptionId $vmInfo.SubscriptionId -Tags $group.Tags -MaintenanceWindow $windowName
            if ($result) { $totalSuccess++ } else { $totalFailed++ }
        } else {
            Write-Host "  ⚠️ VM not found: $vmName" -ForegroundColor Yellow
            $totalFailed++
        }
    }
}

Write-Host "`n=== TAGGING SUMMARY ===" -ForegroundColor Cyan
Write-Host "Successfully tagged: $totalSuccess VMs" -ForegroundColor Green
Write-Host "Failed to tag: $totalFailed VMs" -ForegroundColor Red

🧹 Handle the Stragglers

For the 12 VMs not in the original MSP schedule, I used intelligent assignment based on their function:

Click to expand: Tagging Script for Remaining Untagged VMs (Sample Script)
# Intelligent VM Tagging Script for Remaining Untagged VMs
# This script analyzes and tags the remaining VMs based on workload patterns and load balancing

$scriptStart = Get-Date

Write-Host "=== Intelligent VM Tagging for Remaining VMs ===" -ForegroundColor Cyan
Write-Host "Analyzing and tagging 26 untagged VMs with optimal maintenance window distribution..." -ForegroundColor White
Write-Host ""

# Function to safely tag a VM across subscriptions
function Set-VMMaintenanceTags {
    param(
        [string]$VMName,
        [string]$ResourceGroupName,
        [string]$SubscriptionId,
        [hashtable]$Tags,
        [string]$MaintenanceWindow
    )

    try {
        # Set context to the VM's subscription
        $currentContext = Get-AzContext
        if ($currentContext.Subscription.Id -ne $SubscriptionId) {
            $null = Set-AzContext -SubscriptionId $SubscriptionId -ErrorAction Stop
        }

        Write-Host "  Processing: $VMName..." -ForegroundColor Yellow

        # Get the VM
        $vm = Get-AzVM -ResourceGroupName $ResourceGroupName -Name $VMName -ErrorAction Stop

        # Add maintenance tags to existing tags (preserve existing tags)
        if ($vm.Tags) {
            $Tags.Keys | ForEach-Object {
                $vm.Tags[$_] = $Tags[$_]
            }
        } else {
            $vm.Tags = $Tags
        }

        # Update the VM tags
        $null = Update-AzVM -VM $vm -ResourceGroupName $ResourceGroupName -Tag $vm.Tags -ErrorAction Stop
        Write-Host "  βœ“ Successfully tagged $VMName for $MaintenanceWindow maintenance" -ForegroundColor Green

        return $true
    }
    catch {
        Write-Host "  βœ— Failed to tag $VMName`: $($_.Exception.Message)" -ForegroundColor Red
        return $false
    }
}

# Define current maintenance window loads (after existing 59 VMs)
$currentLoad = @{
    "Monday" = 7
    "Tuesday" = 7 
    "Wednesday" = 10
    "Thursday" = 6
    "Friday" = 6
    "Saturday" = 17  # Dev/Test at 09:00
    "Sunday" = 6
}

Write-Host "=== CURRENT MAINTENANCE WINDOW LOAD ===" -ForegroundColor Cyan
$currentLoad.GetEnumerator() | Sort-Object Name | ForEach-Object {
    Write-Host "$($_.Key): $($_.Value) VMs" -ForegroundColor White
}

# Initialize counters for new assignments
$newAssignments = @{
    "Monday" = 0
    "Tuesday" = 0
    "Wednesday" = 0
    "Thursday" = 0
    "Friday" = 0
    "Saturday" = 0  # Will use sat-09 for dev/test
    "Sunday" = 0
}

Write-Host ""
Write-Host "=== INTELLIGENT VM GROUPING AND ASSIGNMENT ===" -ForegroundColor Cyan

# Define VM groups with intelligent maintenance window assignments
$vmGroups = @{

    # CRITICAL PRODUCTION SYSTEMS - Spread across different days
    "Critical Infrastructure" = @{
        "VMs" = @(
            @{ Name = "DC-PROD-01"; RG = "rg-infrastructure"; Sub = "Production"; Window = "Sunday"; Reason = "Domain Controller - critical infrastructure" },
            @{ Name = "DC-PROD-02"; RG = "rg-infrastructure"; Sub = "Production"; Window = "Monday"; Reason = "Domain Controller - spread from other DCs" },
            @{ Name = "BACKUP-PROD-01"; RG = "rg-backup"; Sub = "Production"; Window = "Tuesday"; Reason = "Backup Server - spread across week" }
        )
    }

    # PRODUCTION BUSINESS APPLICATIONS - Spread for business continuity
    "Production Applications" = @{
        "VMs" = @(
            @{ Name = "WEB-PROD-01"; RG = "rg-web-production"; Sub = "Production"; Window = "Monday"; Reason = "Web Server - Monday for week start" },
            @{ Name = "DB-PROD-01"; RG = "rg-database-production"; Sub = "Production"; Window = "Tuesday"; Reason = "Database Server - Tuesday" },
            @{ Name = "APP-PROD-01"; RG = "rg-app-production"; Sub = "Production"; Window = "Wednesday"; Reason = "Application Server - mid-week" }
        )
    }

    # DEV/TEST SYSTEMS - Saturday morning maintenance (like existing dev/test)
    "Development Systems" = @{
        "VMs" = @(
            @{ Name = "WEB-DEV-01"; RG = "rg-web-development"; Sub = "Development"; Window = "Saturday"; Reason = "Web Dev - join existing dev/test window" },
            @{ Name = "DB-DEV-01"; RG = "rg-database-development"; Sub = "Development"; Window = "Saturday"; Reason = "Database Dev - join existing dev/test window" },
            @{ Name = "TEST-SERVER-01"; RG = "rg-testing"; Sub = "Development"; Window = "Saturday"; Reason = "Test Server - join existing dev/test window" }
            # ... additional dev/test VMs
        )
    }
}

# Initialize counters
$totalProcessed = 0
$totalSuccess = 0
$totalFailed = 0

# Process each group
foreach ($groupName in $vmGroups.Keys) {
    $group = $vmGroups[$groupName]
    Write-Host "`n=== $groupName ===" -ForegroundColor Magenta
    Write-Host "Processing $($group.VMs.Count) VMs in this group" -ForegroundColor White

    foreach ($vmInfo in $group.VMs) {
        $window = $vmInfo.Window
        $vmName = $vmInfo.Name

        Write-Host "`n�️ $vmName β†’ $window maintenance window" -ForegroundColor Yellow
        Write-Host "   Reason: $($vmInfo.Reason)" -ForegroundColor Gray

        # Determine subscription ID from name
        $subscriptionId = switch ($vmInfo.Sub) {
            "Production" { (Get-AzSubscription -SubscriptionName "Production").Id }
            "DevTest" { (Get-AzSubscription -SubscriptionName "DevTest").Id }
            "Identity" { (Get-AzSubscription -SubscriptionName "Identity").Id }
            "DMZ" { (Get-AzSubscription -SubscriptionName "DMZ").Id }
        }

        # Create appropriate tags based on maintenance window
        $tags = @{
            "Owner" = "Contoso"
            "Updates" = "Azure Update Manager"
        }

        if ($window -eq "Saturday") {
            $tags["PatchWindow"] = "sat-09"  # Saturday 09:00 for dev/test
        } else {
            $tags["PatchWindow"] = $window.ToLower().Substring(0,3)  # mon, tue, wed, etc.
        }

        $result = Set-VMMaintenanceTags -VMName $vmInfo.Name -ResourceGroupName $vmInfo.RG -SubscriptionId $subscriptionId -Tags $tags -MaintenanceWindow $window

        $totalProcessed++
        if ($result) { 
            $totalSuccess++
            $newAssignments[$window]++
        } else { 
            $totalFailed++ 
        }
    }
}

Write-Host ""
Write-Host "=== TAGGING SUMMARY ===" -ForegroundColor Cyan
Write-Host "Total VMs processed: $totalProcessed" -ForegroundColor White
Write-Host "Successfully tagged: $totalSuccess" -ForegroundColor Green
Write-Host "Failed to tag: $totalFailed" -ForegroundColor Red

Write-Host ""
Write-Host "=== NEW MAINTENANCE WINDOW DISTRIBUTION ===" -ForegroundColor Cyan
Write-Host "VMs added to each maintenance window:" -ForegroundColor White

$newAssignments.GetEnumerator() | Sort-Object Name | ForEach-Object {
    if ($_.Value -gt 0) {
        $newTotal = $currentLoad[$_.Key] + $_.Value
        Write-Host "$($_.Key): +$($_.Value) VMs (total: $newTotal VMs)" -ForegroundColor Green
    }
}

Write-Host ""
Write-Host "=== FINAL MAINTENANCE WINDOW LOAD ===" -ForegroundColor Cyan
$finalLoad = @{}
$currentLoad.Keys | ForEach-Object {
    $finalLoad[$_] = $currentLoad[$_] + $newAssignments[$_]
}

$finalLoad.GetEnumerator() | Sort-Object Name | ForEach-Object {
    $status = if ($_.Value -le 8) { "Green" } elseif ($_.Value -le 12) { "Yellow" } else { "Red" }
    Write-Host "$($_.Key): $($_.Value) VMs" -ForegroundColor $status
}

$grandTotal = ($finalLoad.Values | Measure-Object -Sum).Sum
Write-Host "`nGrand Total: $grandTotal VMs across all maintenance windows" -ForegroundColor White

Write-Host ""
Write-Host "=== BUSINESS LOGIC APPLIED ===" -ForegroundColor Cyan
Write-Host "βœ… Critical systems spread across different days for resilience" -ForegroundColor Green
Write-Host "βœ… Domain Controllers distributed to avoid single points of failure" -ForegroundColor Green
Write-Host "βœ… Dev/Test systems consolidated to Saturday morning (existing pattern)" -ForegroundColor Green
Write-Host "βœ… Production workstations spread to minimize user impact" -ForegroundColor Green
Write-Host "βœ… Business applications distributed for operational continuity" -ForegroundColor Green
Write-Host "βœ… Load balancing maintained across the week" -ForegroundColor Green

Write-Host ""
Write-Host "=== VERIFICATION STEPS ===" -ForegroundColor Cyan
Write-Host "1. Verify tags in Azure Portal across all subscriptions" -ForegroundColor White
Write-Host "2. Check that critical systems are on different days" -ForegroundColor White
Write-Host "3. Confirm dev/test systems are in Saturday morning window" -ForegroundColor White
Write-Host "4. Review production systems distribution" -ForegroundColor White

Write-Host ""
Write-Host "=== AZURE RESOURCE GRAPH VERIFICATION QUERY ===" -ForegroundColor Cyan
Write-Host "Use this query to verify all VMs are now tagged:" -ForegroundColor White
Write-Host ""
Write-Host @"
Resources
| where type == "microsoft.compute/virtualmachines"
| where tags.Updates == "Azure Update Manager"
| project name, resourceGroup, subscriptionId, 
          patchWindow = tags.PatchWindow,
          owner = tags.Owner,
          updates = tags.Updates
| sort by patchWindow, name
| summarize count() by patchWindow
"@ -ForegroundColor Gray

if ($totalFailed -eq 0) {
    Write-Host ""
    Write-Host "οΏ½ ALL VMs SUCCESSFULLY TAGGED WITH INTELLIGENT DISTRIBUTION! οΏ½" -ForegroundColor Green
} else {
    Write-Host ""
    Write-Host "⚠️ Some VMs failed to tag. Please review errors above." -ForegroundColor Yellow
}

Write-Host ""
Write-Host "Script completed at $(Get-Date)" -ForegroundColor Cyan
Write-Host "Total runtime: $((Get-Date) - $scriptStart)" -ForegroundColor Gray

Key insight: I grouped VMs by function and criticality, not just by convenience. Domain controllers got spread across different days, dev/test systems joined the existing Saturday morning window, and production applications were distributed for business continuity.


🧰 Step 7 – Configure Azure Policy Prerequisites

Here's where things get interesting. Update Manager is built on compliance β€” but your VMs won't show up in dynamic scopes unless they meet certain prerequisites. Enter Azure Policy to save the day.

You'll need two specific built-in policies assigned at the subscription (or management group) level:

βœ… Policy 1: Set prerequisites for scheduling recurring updates on Azure virtual machines

What it does: This policy ensures your VMs have the necessary configurations to participate in Azure Update Manager. It automatically:

  • Installs the Azure Update Manager extension on Windows VMs
  • Registers required resource providers
  • Configures the VM to report its update compliance status
  • Sets the patch orchestration mode appropriately

Why this matters: Without this policy, VMs won't appear in Update Manager scopes even if they're tagged correctly. The policy handles all the "plumbing" automatically.

Assignment scope: Apply this at subscription or management group level to catch all VMs.

βœ… Policy 2: Configure periodic checking for missing system updates on Azure virtual machines

What it does: This is your compliance engine. It configures VMs to:

  • Regularly scan for available updates (but not install them automatically)
  • Report update status back to Azure Update Manager
  • Enable the compliance dashboard views in the portal
  • Provide the data needed for maintenance configuration targeting

Why this matters: This policy turns on the "update awareness" for your VMs. Without it, Azure Update Manager has no visibility into what patches are needed.

Assignment scope: Same as above β€” subscription or management group level.

🎯 Assigning the Policies

Step-by-step in Azure Portal:

  1. Navigate to Azure Policy
  2. Azure Portal β†’ Search "Policy" β†’ Select "Policy"

  3. Find the First Policy

  4. Left menu: Definitions
  5. Search: Set prerequisites for scheduling recurring updates
  6. Click on the policy title

  7. Assign the Policy

  8. Click Assign button
  9. Scope: Select your subscription(s)
  10. Basics: Leave policy name as default
  11. Parameters: Leave as default
  12. Remediation: βœ… Check "Create remediation task"
  13. Review + create

  14. Repeat for Second Policy

  15. Search: Configure periodic checking for missing system updates
  16. Follow same assignment process

⚠️ Important: Policy compliance can take 30+ minutes to evaluate and apply. Perfect time for that brew I mentioned earlier.

πŸ” Monitoring Compliance

Once assigned, you can track compliance in Azure Policy > Compliance. Look for:

  • Non-compliant VMs that need the extension installed
  • VMs that aren't reporting update status properly
  • Any policy assignment errors that need investigation

Learn more about Azure Policy for Update Management


πŸ§ͺ Step 8 – Create Dynamic Scopes in Update Manager

This is where it all comes together β€” and where the magic happens.

Dynamic scopes use those PatchWindow tags to assign VMs to the correct patch config automatically. No more manual VM assignment, no more "did we remember to add the new server?" conversations.

🎯 The Portal Dance

Unfortunately, as of writing, dynamic scopes can only be configured through the Azure portal β€” no PowerShell or ARM template support yet.

Why portal only? Dynamic scopes are still in preview, and Microsoft hasn't released the PowerShell cmdlets or ARM template schemas yet. This means you can't fully automate the deployment, but the functionality itself works perfectly.

Here's the step-by-step:

  1. Navigate to Azure Update Manager
  2. Portal β†’ All Services β†’ Azure Update Manager

  3. Access Maintenance Configurations

  4. Go to Maintenance Configurations (Preview)
  5. Select one of your configs (e.g., contoso-maintenance-config-vms-mon)

  6. Create Dynamic Scope

  7. Click Dynamic Scopes β†’ Add
  8. Name: DynamicScope-Monday-VMs
  9. Description: Auto-assign Windows VMs tagged for Monday maintenance

  10. Configure Scope Settings

  11. Subscription: Select your subscription(s)
  12. Resource Type: Microsoft.Compute/virtualMachines
  13. OS Type: Windows (create separate scopes for Linux if needed)

  14. Set Tag Filters

  15. Tag Name: PatchWindow
  16. Tag Value: mon (must match your maintenance config naming)
  17. Additional filters (optional):

    • Owner = Contoso
    • Updates = Azure Update Manager
  18. Review and Create

  19. Verify the filter logic
  20. Click Create

πŸ”„ Repeat for All Days

You'll need to create dynamic scopes for each maintenance configuration:

Maintenance Config Dynamic Scope Name Tag Filter
contoso-maintenance-config-vms-mon DynamicScope-Monday-VMs PatchWindow = mon
contoso-maintenance-config-vms-tue DynamicScope-Tuesday-VMs PatchWindow = tue
contoso-maintenance-config-vms-wed DynamicScope-Wednesday-VMs PatchWindow = wed
contoso-maintenance-config-vms-thu DynamicScope-Thursday-VMs PatchWindow = thu
contoso-maintenance-config-vms-fri DynamicScope-Friday-VMs PatchWindow = fri
contoso-maintenance-config-vms-sat DynamicScope-Saturday-VMs PatchWindow = sat-09
contoso-maintenance-config-vms-sun DynamicScope-Sunday-VMs PatchWindow = sun

πŸ” Verify Dynamic Scope Assignment

Once created, you can verify the scopes are working:

  1. In the Maintenance Configuration:
  2. Go to Dynamic Scopes
  3. Check Resources tab to see matched VMs
  4. Verify expected VM count matches your tagging
  5. Wait time: Allow 15-30 minutes for newly tagged VMs to appear

  6. What success looks like:

  7. Monday scope shows 5 VMs (WEB-PROD-01, DB-PROD-01, etc.)
  8. Saturday scope shows 5 VMs (WEB-DEV-01, DB-DEV-01, etc.)
  9. No VMs showing? Check tag case sensitivity and filters

  10. In Azure Resource Graph:

MaintenanceResources
| where type == "microsoft.maintenance/configurationassignments"
| extend vmName = tostring(split(resourceId, "/")[8])
| extend configName = tostring(properties.maintenanceConfigurationId)
| project vmName, configName, resourceGroup
| order by configName, vmName
  1. Troubleshoot empty scopes:
  2. Verify subscription selection includes all your VMs
  3. Check tag spelling: PatchWindow (case sensitive)
  4. Confirm resource type filter: Microsoft.Compute/virtualMachines
  5. Wait longer - it can take up to 30 minutes

⚠️ Common Gotchas

Tag Case Sensitivity: Dynamic scopes are case-sensitive. mon β‰  Mon β‰  MON

Subscription Scope: Ensure you've selected all relevant subscriptions in the scope configuration.

Resource Type Filter: Don't forget to set the resource type filter β€” without it, you'll match storage accounts, networking, etc.

Timing: It can take 15-30 minutes for newly tagged VMs to appear in dynamic scopes.

Dynamic scope configuration docs


πŸš€ Step 9 – Test & Verify (The Moment of Truth)

The acid test: does it actually patch stuff properly?

πŸŽͺ Proof of Concept Test

I started conservatively β€” scoped contoso-maintenance-config-vms-sun to a few non-critical VMs and let it run overnight on Sunday.

Monday morning verification:

  • βœ”οΈ Patch compliance dashboard: All green ticks
  • βœ”οΈ Reboot timing: Machines restarted within their 4-hour window (21:00-01:00)
  • βœ”οΈ Update logs: Activity logs showed expected patching behavior
  • βœ”οΈ Business impact: Zero helpdesk tickets on Monday morning

πŸ“Š Full Rollout Verification

Once confident with the Sunday test, I enabled all remaining dynamic scopes and monitored the week:

Key metrics tracked:

  • Patch compliance percentage across all VMs
  • Failed patch installations (and root causes)
  • Reboot timing adherence
  • Business hours impact (spoiler: zero)

πŸ” Monitoring & Validation Tools

Azure Update Manager Dashboard:

Azure Portal β†’ Update Manager β†’ Overview
- Patch compliance summary
- Recent patch installations
- Failed installations with details

Azure Resource Graph Queries:

// Verify all VMs have maintenance tags
Resources
| where type == "microsoft.compute/virtualmachines"
| where tags.Updates == "Azure Update Manager"
| project name, resourceGroup, subscriptionId, 
          patchWindow = tags.PatchWindow,
          owner = tags.Owner
| summarize count() by patchWindow
| order by patchWindow

// Check maintenance configuration assignments
MaintenanceResources
| where type == "microsoft.maintenance/configurationassignments"
| extend vmName = tostring(split(resourceId, "/")[8])
| extend configName = tostring(properties.maintenanceConfigurationId)
| project vmName, configName, subscriptionId
| summarize VMCount = count() by configName
| order by configName

PowerShell Verification:

# Quick check of maintenance configuration status
Get-AzMaintenanceConfiguration -ResourceGroupName "rg-maintenance-uksouth-001" | 
    Select-Object Name, MaintenanceScope, RecurEvery | 
    Format-Table -AutoSize

# Verify VM tag distribution
$subscriptions = Get-AzSubscription | Where-Object { $_.State -eq "Enabled" }
$tagSummary = @{}

foreach ($sub in $subscriptions) {
    Set-AzContext -SubscriptionId $sub.Id | Out-Null
    $vms = Get-AzVM | Where-Object { $_.Tags.PatchWindow }

    foreach ($vm in $vms) {
        $window = $vm.Tags.PatchWindow
        if (-not $tagSummary.ContainsKey($window)) {
            $tagSummary[$window] = 0
        }
        $tagSummary[$window]++
    }
}

Write-Host "=== VM DISTRIBUTION BY PATCH WINDOW ===" -ForegroundColor Cyan
$tagSummary.GetEnumerator() | Sort-Object Name | ForEach-Object {
    Write-Host "$($_.Key): $($_.Value) VMs" -ForegroundColor White
}

πŸ“ˆ Success Metrics

After two full weeks of operation:

  • Better control: Direct management of patch schedules and policies
  • Increased visibility: Real-time compliance dashboards vs. periodic reports
  • Reduced complexity: Native Azure tooling vs. third-party solutions

Monitor updates in Azure Update Manager


πŸ“ƒ Final Thoughts & Tips

βœ… Cost-neutral β€” No more third-party patch agents βœ… Policy-driven β€” Enforced consistency with Azure Policy βœ… Easily auditable β€” Tag-based scoping is clean and visible βœ… Scalable β€” New VMs auto-join patch schedules via tagging

⚠️ Troubleshooting Guide & Common Issues

Here's what I learned the hard way, so you don't have to:

Symptom Possible Cause Fix
VM not showing in dynamic scope Tag typo or case mismatch Verify PatchWindow tag exactly matches config name
Maintenance config creation fails Invalid duration format Use ISO 8601 format: "03:00" not "3 hours"
VM skipped during patching Policy prerequisites not met Check Azure Policy compliance dashboard
No updates applied despite schedule VM needs pending reboot Clear previous reboots, check update history
Dynamic scope shows zero VMs Wrong subscription scope Verify subscription selection in scope config
Extension installation failed Insufficient permissions Ensure VM contributor rights and resource provider registration
Policy compliance stuck at 0% Assignment scope too narrow Check policy is assigned at subscription level
VMs appear/disappear from scope Tag inconsistency Run tag verification script across all subscriptions

πŸ”§ Advanced Troubleshooting Commands

Check VM Update Readiness:

# Verify VM has required extensions and configuration
$vmName = "your-vm-name"
$rgName = "your-resource-group"

$vm = Get-AzVM -Name $vmName -ResourceGroupName $rgName -Status
$vm.Extensions | Where-Object { $_.Name -like "*Update*" -or $_.Name -like "*Maintenance*" }

Validate Maintenance Configuration:

# Test maintenance configuration is properly formed
$config = Get-AzMaintenanceConfiguration -ResourceGroupName "rg-maintenance-uksouth-001" -Name "contoso-maintenance-config-vms-mon"
Write-Host "Config Name: $($config.Name)"
Write-Host "Recurrence: $($config.RecurEvery)"
Write-Host "Duration: $($config.Duration)"
Write-Host "Start Time: $($config.StartDateTime)"
Write-Host "Timezone: $($config.TimeZone)"

Policy Compliance Deep Dive:

# Check specific VMs for policy compliance
$policyName = "Set prerequisites for scheduling recurring updates on Azure virtual machines"
$assignments = Get-AzPolicyAssignment | Where-Object { $_.Properties.DisplayName -eq $policyName }
foreach ($assignment in $assignments) {
    Get-AzPolicyState -PolicyAssignmentId $assignment.PolicyAssignmentId | 
        Where-Object { $_.ComplianceState -eq "NonCompliant" } |
        Select-Object ResourceId, ComplianceState, @{Name="Reason";Expression={$_.PolicyEvaluationDetails.EvaluatedExpressions.ExpressionValue}}
}

As always, comments and suggestions welcome over on GitHub or LinkedIn. If you've migrated patching in a different way, I'd love to hear how you approached it.

Share on Share on

βš™οΈ Azure BCDR Review – Turning Inherited Cloud Infrastructure into a Resilient Recovery Strategy

When we inherited our Azure estate from a previous MSP, some of the key technical components were already in place β€” ASR was configured for a number of workloads, and backups had been partially implemented across the environment.

What we didn’t inherit was a documented or validated BCDR strategy.

There were no formal recovery plans defined in ASR, no clear failover sequences, and no evidence that a regional outage scenario had ever been modelled or tested. The building blocks were there β€” but there was no framework tying them together into a usable or supportable recovery posture.

This post shares how I approached the challenge of assessing and strengthening our Azure BCDR readiness. It's not about starting from scratch β€” it's about applying structure, logic, and realism to an environment that had the right intentions but lacked operational clarity.

Whether you're stepping into a similar setup or planning your first formal DR review, I hope this provides a practical and relatable blueprint.


🎯 Where We Started: Technical Foundations, Operational Gaps

We weren’t starting from zero β€” but we weren’t in a position to confidently recover the environment either.

What we found:

  • 🟒 ASR replication was partially implemented
  • 🟑 VM backups were present but inconsistent
  • ❌ No Recovery Plans existed in ASR
  • ❌ No test failovers had ever been performed
  • ⚠️ No documented RTO/RPO targets
  • ❓ DNS and Private Endpoints weren’t accounted for in DR
  • πŸ”’ Networking had not been reviewed for failover scenarios
  • 🚫 No capacity reservations had been made

This review was the first step in understanding whether our DR setup could work in practice β€” not just in theory.


πŸ›‘οΈ 1️⃣ Workload Protection: What’s Covered, What’s Not

Some workloads were actively replicated via ASR. Others were only backed up. Some had both, a few had neither. There was no documented logic to explain why.

Workload protection appeared to be driven by convenience or historical context β€” not by business impact or recovery priority.

What we needed was a structured tiering model:

  • 🧠 Which workloads are mission-critical?
  • ⏱️ Which ones can tolerate extended recovery times?
  • πŸ“Š What RTOs and RPOs are actually achievable?

This became the foundation for everything else.


🧩 2️⃣ The Missing Operational Layer

We had technical coverage β€” but no operational recovery strategy.

There were no Recovery Plans in ASR. No sequencing, no post-failover validation, and no scripts or automation.

In the absence of structure, recovery would be entirely manual β€” relying on individual knowledge, assumptions, and good luck during a critical event.

Codifying dependencies, failover order, and recovery steps became a priority.


🌐 3️⃣ DNS, Identity and Private Endpoint Blind Spots

DNS and authentication are easy to overlook β€” until they break.

Our name resolution relied on internal DNS via AD-integrated zones, with no failover logic for internal record switching. No private DNS zones were in place.

Private Endpoints were widely used, but all existed in the primary region. In a DR scenario, they would become unreachable.

Identity was regionally redundant, but untested and not AZ-aware.

We needed to promote DNS, identity, and PE routing to first-class DR concerns.


πŸ’Ύ 4️⃣ Storage and Data Access Risk

Azure Storage backed a range of services β€” from SFTP and app data to file shares and diagnostics.

Replication strategies varied (LRS, RA-GRS, ZRS) with no consistent logic or documentation. Critical storage accounts weren’t aligned with workload tiering.

Some workloads used Azure Files and Azure File Sync, but without defined mount procedures or recovery checks.

In short: compute could come back, but data availability wasn’t assured.


πŸ”Œ 5️⃣ The Networking Piece (And Why It Matters More Than People Think)

NSGs, UDRs, custom routing, and SD-WAN all played a part in how traffic flowed.

But in DR, assumptions break quickly.

There was no documentation of network flow in the DR region, and no validation of inter-VM or service-to-service reachability post-failover.

Some services β€” like App Gateways, Internal Load Balancers, and Private Endpoints β€” were region-bound and would require re-deployment or manual intervention.

Networking wasn’t the background layer β€” it was core to recoverability.


πŸ“¦ 6️⃣ Capacity Risk: When DR Isn’t Guaranteed

VM replication is only half the story. The other half is whether those VMs can actually start during a DR event.

Azure doesn’t guarantee regional capacity unless you've pre-purchased it.

In our case, no capacity reservations had been made. That meant no assurance that our Tier 0 or Tier 1 workloads could even boot if demand spiked during a region-wide outage.

This is a quiet but critical risk β€” and one worth addressing early.


βœ… Conclusion: From Discovery to Direction

This review wasn’t about proving whether DR was in place β€” it was about understanding whether it would actually work.

The tooling was present. The protection was partial. The process was missing.

By mapping out what was covered, where the gaps were, and how recovery would actually unfold, we created a baseline that gave us clarity and control.


πŸ“˜ Coming Next: Documenting the Plan

In the next post, I’ll walk through how I formalised the review into a structured BCDR posture document β€” including:

  • 🧱 Mapping workloads by tier and impact
  • ⏳ Defining current vs target RTO/RPO
  • πŸ› οΈ Highlighting gaps in automation, DNS, storage, and capacity
  • 🧭 Building a recovery plan roadmap
  • βš–οΈ Framing cost vs risk for stakeholder alignment

If you're facing a similar situation β€” whether you're inheriting someone else's cloud estate or building DR into a growing environment β€” I hope this series helps bring structure to the complexity.

Let me know if you'd find it useful to share templates or walkthroughs in the next post.


Share on Share on

🧾 Azure BCDR – How I Turned a DR Review into a Strategic Recovery Plan

In Part 1 of this series, I shared how we reviewed our Azure BCDR posture after inheriting a partially implemented cloud estate. The findings were clear: while the right tools were in place, the operational side of disaster recovery hadn’t been addressed.

There were no test failovers, no documented Recovery Plans, no automation, and several blind spots in DNS, storage, and private access.

This post outlines how I took that review and turned it into a practical recovery strategy β€” one that we could share internally, align with our CTO, and use as a foundation for further work with our support partner.

To provide context, our estate is deployed primarily in the UK South Azure region, with UK West serving as the designated DR target region.

It’s not a template β€” it’s a repeatable, real-world approach to structuring a BCDR plan when you’re starting from inherited infrastructure, not a clean slate.


🧭 1. Why Documenting the Plan Matters

Most cloud teams can identify issues. Fewer take the time to formalise the findings in a way that supports action and alignment.

Documenting our BCDR posture gave us three things:

  • 🧠 Clarity β€” a shared understanding of what’s protected and what isn’t
  • πŸ”¦ Visibility β€” a way to surface risk and prioritise fixes
  • 🎯 Direction β€” a set of realistic, cost-aware next steps

We weren’t trying to solve every problem at once. The goal was to define a usable plan we could act on, iterate, and eventually test β€” all while making sure that effort was focused on the right areas.


🧱 2. Starting the Document

I structured the document to speak to both technical stakeholders and senior leadership. It needed to balance operational context with strategic risk.

✍️ Core sections included

  • Executive Summary – what the document is, why it matters
  • Maturity Snapshot – a simple traffic-light view of current vs target posture
  • Workload Overview – what’s in scope and what’s protected
  • Recovery Objectives – realistic RPO/RTO targets by tier
  • Gaps and Risks – the areas most likely to cause DR failure
  • Recommendations – prioritised, actionable, and cost-aware
  • Next Steps – what we can handle internally, and what goes to the MSP

Each section followed the same principle: clear, honest, and focused on action. No fluff, no overstatements β€” just a straightforward review of where we stood and what needed doing.


🧩 3. Defining the Current State

Before we could plan improvements, we had to document what actually existed. This wasn’t about assumptions β€” it was about capturing the real configuration and coverage in Azure.

πŸ—‚οΈ Workload Inventory

We started by categorising all VMs and services:

  • Domain controllers
  • Application servers (web/API/backend)
  • SQL Managed Instances
  • Infrastructure services (file, render, schedulers)
  • Management and monitoring VMs

Each workload was mapped by criticality and recovery priority β€” not just by type.

πŸ›‘οΈ Protection Levels

For each workload, we recorded:

  • βœ… Whether it was protected by ASR
  • πŸ” Whether it was backed up only
  • 🚫 Whether it had no protection (with justification)

We also reviewed the geographic layout β€” e.g. which services were replicated into UK West, and which existed only in UK South.

🧠 Supporting Services

Beyond the VMs, we looked at:

  • Identity services (AD, domain controllers, replication health)
  • DNS architecture (AD-integrated zones, private DNS zones)
  • Private Endpoints and their region-specific availability
  • Storage account replication types (LRS, RA-GRS, ZRS)
  • Network security and routing configurations in DR

The aim wasn’t to build a full asset inventory β€” just to gather enough visibility to start making risk-based decisions about what mattered, and what was missing.


⏱️ 4. Setting Recovery Objectives

Once the current state was mapped, the next step was to define what β€œrecovery” should actually look like β€” in terms that could be communicated, challenged, and agreed.

We focused on two key metrics:

  • RTO (Recovery Time Objective): How long can this system be offline before we see significant operational impact?
  • RPO (Recovery Point Objective): How much data loss is acceptable in a worst-case failover?

These weren’t guessed or copied from a template. We worked with realistic assumptions based on our tooling, team capability, and criticality of the services.

πŸ“Š Tiered Recovery Model

Each workload was assigned to one of four tiers:

Tier Description
Tier 0 Core infrastructure (identity, DNS, routing)
Tier 1 Mission-critical production workloads
Tier 2 Important, but not time-sensitive services
SQL MI Treated separately due to its PaaS nature

We then applied RTO and RPO targets based on what we could achieve today vs what we aim to reach with improvements.

πŸ”₯ Heatmap Example

Workload Tier RPO (Current) RTO (Current) RPO (Optimised) RTO (Optimised)
Tier 0 – Identity 5 min 60 min 5 min 30 min
Tier 1 – Prod 5 min 360 min 5 min 60 min
Tier 2 – Non-Crit 1440 min 1440 min 60 min 240 min
SQL MI 0 min 60 min 0 min 30 min

🚧 5. Highlighting Gaps and Risks

With recovery objectives defined, the gaps became much easier to identify β€” and to prioritise.

We weren’t trying to protect everything equally. The goal was to focus attention on the areas that introduced the highest risk to recovery if left unresolved.

⚠️ What We Flagged

  • ❌ No test failovers had ever been performed
  • ❌ No Recovery Plans existed
  • 🌐 Public-facing infrastructure only existed in one region
  • πŸ”’ Private Endpoints lacked DR equivalents
  • 🧭 DNS failover was manual or undefined
  • πŸ’Ύ Storage accounts had inconsistent replication logic
  • 🚫 No capacity reservations existed for critical VM SKUs

Each gap was documented with its impact, priority, and remediation options.


πŸ› οΈ 6. Strategic Recommendations

We split our recommendations into what we could handle internally, and what would require input from our MSP or further investment.

πŸ“Œ Internal Actions

  • Build and test Recovery Plans for Tier 0 and Tier 1 workloads
  • Improve DNS failover scripting
  • Review VM tags to reflect criticality and protection state
  • Create sequencing logic for application groups
  • Align NSGs and UDRs in DR with production

🀝 MSP-Led or Partner Support

  • Duplicate App Gateways / ILBs in UK West
  • Implement Private DNS Zones
  • Review and implement capacity reservations
  • Test runbook-driven recovery automation
  • Conduct structured test failovers across service groups

πŸ“… 7. Making It Actionable

A plan needs ownership and timelines. We assigned tasks by role and defined short-, medium-, and long-term priorities using a simple planning table.

We treat the BCDR document as a living artefact β€” updated quarterly, tied to change control, and used to guide internal work and partner collaboration.


πŸ”š 8. Closing Reflections

The original goal wasn’t to build a perfect DR solution β€” it was to understand where we stood, make recovery realistic, and document a plan that would hold up when we needed it most.

We inherited a functional technical foundation β€” but needed to formalise and validate it as part of a resilient DR posture.

By documenting the estate, defining recovery objectives, and identifying where the real risks were, we turned a passive DR posture into something we could act on. We gave stakeholders clarity. We gave the support partner direction. And we gave ourselves a roadmap.


πŸ”œ What’s Next

In the next part of this series, I’ll walk through how we executed the plan:

  • Building and testing our first Recovery Plan
  • Improving ASR coverage and validation
  • Running our first failover drill
  • Reviewing results and updating the heatmap

If you're stepping into an inherited cloud environment or starting your first structured DR review, I hope this gives you a practical view of what’s involved β€” and what’s achievable without overcomplicating the process.

Let me know if you'd like to see templates or report structures from this process in a future post.


Share on Share on

πŸ’° Saving Azure Costs with Scheduled VM Start/Stop using Custom Azure Automation Runbooks

As part of my ongoing commitment to FinOps practices, I've implemented several strategies to embed cost-efficiency into the way we manage cloud infrastructure. One proven tactic is scheduling virtual machines to shut down during idle periods, avoiding unnecessary spend.

In this post, I’ll share how I’ve built out custom Azure Automation jobs to schedule VM start and stop operations. Rather than relying on Microsoft’s pre-packaged solution, I’ve developed a streamlined, purpose-built PowerShell implementation that provides maximum flexibility, transparency, and control.


✍️ Why I Chose Custom Runbooks Over the Prebuilt Solution

Microsoft provides a ready-made β€œStart/Stop VMs during off-hours” solution via the Automation gallery. While functional, it’s:

  • A bit over-engineered for simple needs,
  • Relatively opaque under the hood, and
  • Not ideal for environments where control and transparency are priorities.

My custom jobs:

  • Use native PowerShell modules within Azure Automation,
  • Are scoped to exactly the VMs I want via tags,
  • Provide clean logging and alerting, and
  • Keep things simple, predictable, and auditable.

πŸ› οΈ Step 1: Set Up the Azure Automation Account

πŸ”— Official docs: Create and manage an Azure Automation Account

  1. Go to the Azure Portal and search for Automation Accounts.
  2. Click + Create.
  3. Fill out the basics:
  4. Name: e.g. vm-scheduler
  5. Resource Group: Create new or select existing
  6. Region: Preferably where your VMs are located
  7. Enable System-Assigned Managed Identity
  8. Once created, go to the Automation Account and ensure the following modules are imported using the Modules blade in the Azure Portal:
  9. Az.Accounts
  10. Az.Compute

βœ… Tip: These modules can be added from the gallery in just a few clicks via the UIβ€”no scripting required.

πŸ’‘ Prefer scripting? You can also install them using PowerShell:

Install-Module -Name Az.Accounts -Force
Install-Module -Name Az.Compute -Force
  1. Assign the Virtual Machine Contributor role to the Automation Account's managed identity at the resource group or subscription level.

βš™οΈ CLI or PowerShell alternatives

# Azure CLI example to create the automation account
az automation account create \
  --name vm-scheduler \
  --resource-group MyResourceGroup \
  --location uksouth \
  --assign-identity

πŸ“… Step 2: Add VM Tags for Scheduling

Apply consistent tags to any VM you want the runbooks to manage.

Key Value
AutoStartStop devserver

You can use the Azure Portal or PowerShell to apply these tags.

βš™οΈ Tag VMs via PowerShell

$vm = Get-AzVM -ResourceGroupName "MyRG" -Name "myVM"
$vm.Tags["AutoStartStop"] = "devserver"
Update-AzVM -VM $vm -ResourceGroupName "MyRG"

πŸ“‚ Step 3: Create the Runbooks

πŸ”— Official docs: Create a runbook in Azure Automation

▢️ Create a New Runbook

  1. In your Automation Account, go to Process Automation > Runbooks.
  2. Click + Create a runbook.
  3. Name it something like Stop-TaggedVMs.
  4. Choose PowerShell as the type.
  5. Paste in the code below (repeat this process for the start runbook later).

πŸ”Ή Runbook Code: Auto-Stop Based on Tags

Param
(    
    [Parameter(Mandatory=$false)][ValidateNotNullOrEmpty()]
    [String]
    $AzureVMName = "All",

    [Parameter(Mandatory=$true)][ValidateNotNullOrEmpty()]
    [String]
    $AzureSubscriptionID = "<your-subscription-id>"
)

try {
    "Logging in to Azure..."
    # Authenticate using the system-assigned managed identity of the Automation Account
    Connect-AzAccount -Identity -AccountId "<managed-identity-client-id>"
} catch {
    Write-Error -Message $_.Exception
    throw $_.Exception
}

$TagName  = "AutoStartStop"
$TagValue = "devserver"

Set-AzContext -Subscription $AzureSubscriptionID

if ($AzureVMName -ne "All") {
    $VMs = Get-AzResource -TagName $TagName -TagValue $TagValue | Where-Object {
        $_.ResourceType -like 'Microsoft.Compute/virtualMachines' -and $_.Name -like $AzureVMName
    }
} else {
    $VMs = Get-AzResource -TagName $TagName -TagValue $TagValue | Where-Object {
        $_.ResourceType -like 'Microsoft.Compute/virtualMachines'
    }
}

foreach ($VM in $VMs) {
    Stop-AzVM -ResourceGroupName $VM.ResourceGroupName -Name $VM.Name -Verbose -Force
}

πŸ”— Docs: Connect-AzAccount with Managed Identity

πŸ”Ή Create the Start Runbook

Duplicate the above, replacing Stop-AzVM with Start-AzVM.

πŸ”— Docs: Start-AzVM


πŸ”— Docs: Create schedules in Azure Automation

  1. Go to the Automation Account > Schedules > + Add a schedule.
  2. Create two schedules:
  3. DailyStartWeekdays β€” Recurs every weekday at 07:30
  4. DailyStopWeekdays β€” Recurs every weekday at 18:30
  5. Go to each runbook > Link to schedule > Choose the matching schedule.

πŸ“Š You can get creative here: separate schedules for dev vs UAT, or different times for different departments.


πŸ§ͺ Testing Your Runbooks

You can test each runbook directly in the portal:

  • Open the runbook
  • Click Edit > Test Pane
  • Provide test parameters if needed
  • Click Start and monitor output

This is also a good time to validate:

  • The identity has permission
  • The tags are applied correctly
  • The VMs are in a stopped or running state as expected

πŸ“Š The Results

Even this lightweight automation has produced major savings in our environment. Non-prod VMs are now automatically turned off outside office hours, resulting in monthly compute savings of up to 60% without sacrificing availability during working hours.


🧠 Ideas for Further Enhancement

  • Pull tag values from a central config (e.g. Key Vault or Storage Table)
  • Add logic to check for active RDP sessions or Azure Monitor heartbeats
  • Alert via email or Teams on job success/failure
  • Track savings over time and visualize them

πŸ’­ Final Thoughts

If you’re looking for a practical, immediate way to implement FinOps principles in Azure, VM scheduling is a great place to start. With minimal setup and maximum flexibility, custom runbooks give you control without the complexity of the canned solutions.

Have you built something similar or extended this idea further? I’d love to hear about itβ€”drop me a comment or reach out on LinkedIn.

Stay tuned for more FinOps tips coming soon!

Share on Share on

πŸ•΅οΈ Replacing SAS Tokens with User Assigned Managed Identity (UAMI) in AzCopy for Blob Uploads

Using Shared Access Signature (SAS) tokens with azcopy is common β€” but rotating tokens and handling them securely can be a hassle. To improve security and simplify our automation, I recently replaced SAS-based authentication in our scheduled AzCopy jobs with Azure User Assigned Managed Identity (UAMI).

In this post, I’ll walk through how to:

  • Replace AzCopy SAS tokens with managed identity authentication
  • Assign the right roles to the UAMI
  • Use azcopy login to authenticate non-interactively
  • Automate the whole process in PowerShell

πŸ” Why Remove SAS Tokens?

SAS tokens are useful, but:

  • πŸ”‘ They’re still secrets β€” and secrets can be leaked
  • πŸ“… They expire β€” which breaks automation when not rotated
  • πŸ” They grant broad access β€” unless scoped very carefully

Managed Identity is a much better approach when the copy job is running from within Azure (like an Azure VM or Automation account).


🌟 Project Goal

Replace the use of SAS tokens in an AzCopy job that uploads files from a local UNC share to Azure Blob Storage β€” by using a User Assigned Managed Identity.


βœ… Prerequisites

To follow along, you’ll need:

  • A User Assigned Managed Identity (UAMI)
  • A Windows Server or Azure VM to run the copy job
  • Access to a local source folder or UNC share (e.g., \\fileserver\\data\\export\\)
  • AzCopy v10.7+ installed on the machine
  • Azure RBAC permissions to assign roles

ℹ️ Check AzCopy Version: Run azcopy --version to ensure you're using v10.7.0 or later, which is required for --identity-client-id support.


πŸ”§ Step-by-Step Setup

πŸ› οΈ Step 1: Create the UAMI

βœ… CLI
az identity create \
  --name my-azcopy-uami \
  --resource-group my-resource-group \
  --location <region>
βœ… Portal
  1. Go to Managed Identities in the Azure Portal
  2. Click + Create and follow the wizard

πŸ–‡οΈ Step 2: Assign the UAMI to the Azure VM

AzCopy running on a VM must be able to assume the identity. Assign the UAMI to your VM:

βœ… CLI
az vm identity assign \
  --name my-vm-name \
  --resource-group my-resource-group \
  --identities my-azcopy-uami
βœ… Portal
  1. Navigate to the Virtual Machines blade
  2. Select the VM running your AzCopy script
  3. Under Settings, click Identity
  4. Go to the User assigned tab
  5. Click + Add, select your UAMI, then click Add

πŸ” Step 3: Assign RBAC Permissions to UAMI

For AzCopy to function correctly with a UAMI, the following role assignments are recommended:

  • Storage Blob Data Contributor: Required for read/write blob operations
  • Storage Blob Data Reader: (Optional) For read-only scenarios or validation scripts
  • Reader: (Optional) For browsing or metadata-only permissions on the storage account

⏳ RBAC Tip: It may take up to 5 minutes for role assignments to propagate fully. If access fails initially, wait and retry.

βœ… CLI
az role assignment create \
  --assignee <client-id-or-object-id> \
  --role "Storage Blob Data Contributor" \
  --scope "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<storage-account>/blobServices/default/containers/<container-name>"

az role assignment create \
  --assignee <client-id-or-object-id> \
  --role "Storage Blob Data Reader" \
  --scope "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<storage-account>"

az role assignment create \
  --assignee <client-id-or-object-id> \
  --role "Reader" \
  --scope "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<storage-account>"
βœ… Portal
  1. Go to your Storage Account in the Azure Portal
  2. Click on the relevant container (or stay at the account level for broader scope)
  3. Open Access Control (IAM)
  4. Click + Add role assignment
  5. Repeat this for each role:
  6. Select Storage Blob Data Contributor, assign to your UAMI, and click Save
  7. Select Storage Blob Data Reader, assign to your UAMI, and click Save
  8. Select Reader, assign to your UAMI, and click Save

πŸ§ͺ Step 4: Test AzCopy Login Using UAMI

$clientId = "<your-uami-client-id>"
& "C:\azcopy\azcopy.exe" login --identity --identity-client-id $clientId

You should see a confirmation message that AzCopy has successfully logged in.

πŸ” To verify AzCopy is authenticated with the correct identity, you can run:

azcopy env

This will show the login type and confirm whether the token is being sourced from the Managed Identity.


πŸ“ Step 5: Upload Files Using AzCopy + UAMI

Here's the PowerShell script that copies all files from a local share to the Blob container:

$clientId = "<your-uami-client-id>"

# Login with Managed Identity
& "C:\azcopy\azcopy.exe" login --identity --identity-client-id $clientId

# Run the copy job
& "C:\azcopy\azcopy.exe" copy \
  "\\\\fileserver\\data\\export\\" \
  "https://<your-storage-account>.blob.core.windows.net/<container-name>" \
  --overwrite=true \
  --from-to=LocalBlob \
  --blob-type=Detect \
  --put-md5 \
  --recursive \
  --log-level=INFO

πŸ’‘ UNC Note: Double backslashes are used in PowerShell to represent UNC paths properly.

This script can be scheduled using Task Scheduler or run on demand.


⏱️ Automate with Task Scheduler (Optional)

To automate the job:

  1. Open Task Scheduler on your VM
  2. Create a New Task (not a Basic Task)
  3. Under General, select "Run whether user is logged on or not"
  4. Under Actions, add a new action to run powershell.exe
  5. Set the arguments to point to your .ps1 script
  6. Ensure the AzCopy path is hardcoded in your script

πŸš‘ Troubleshooting Common Errors

❌ 403 AuthorizationPermissionMismatch
  • Usually means the identity doesn’t have the correct role or the role hasn’t propagated yet
  • Double-check:
  • UAMI is assigned to the VM
  • UAMI has Storage Blob Data Contributor on the correct container
  • Wait 2–5 minutes and try again
❌ azcopy : The term 'azcopy' is not recognized
  • AzCopy is not in the system PATH
  • Solution: Use the full path to azcopy.exe, like C:\azcopy\azcopy.exe

πŸ›‘οΈ Benefits of Switching to UAMI

  • βœ… No secrets or keys stored on disk
  • βœ… No manual token expiry issues
  • βœ… Access controlled via Azure RBAC
  • βœ… Easily scoped and auditable

🧼 Final Thoughts

Replacing AzCopy SAS tokens with UAMI is one of those small wins that pays dividends over time. Once set up, it's secure, robust, and hands-off.

Let me know if you'd like a variant of this that works from Azure Automation or a hybrid worker!


Share on Share on

Replacing SQL Credentials with User Assigned Managed Identity (UAMI) in Azure SQL Managed Instance

Storing SQL usernames and passwords in application configuration files is still common practice β€” but it poses a significant security risk. As part of improving our cloud security posture, I recently completed a project to eliminate plain text credentials from our app connection strings by switching to Azure User Assigned Managed Identity (UAMI) authentication for our SQL Managed Instance.

In this post, I’ll walk through how to:

  • Securely connect to Azure SQL Managed Instance without using usernames or passwords
  • Use a User Assigned Managed Identity (UAMI) for authentication
  • Test this connection using the new Go-based sqlcmd CLI
  • Update real application code to remove SQL credentials

πŸ” Why Replace SQL Credentials?

Hardcoded SQL credentials come with several downsides:

  • Security risk: Stored secrets can be compromised if not properly secured
  • Maintenance overhead: Rotating passwords across environments is cumbersome
  • Audit concerns: Plain text credentials often trigger compliance red flags

Azure Managed Identity solves this by providing a token-based, identity-first way to connect to services β€” no secrets required.


βš™οΈ What is a User Assigned Managed Identity?

There are two types of Managed Identities in Azure:

  • System-assigned: Tied to the lifecycle of a specific resource (like a VM or App Service)
  • User-assigned: Standalone identity that can be attached to one or more resources

For this project, we used a User Assigned Managed Identity (UAMI) to allow our applications to authenticate against SQL without managing secrets.


🌟 Project Objective

Replace plain text SQL credentials in application connection strings with User Assigned Managed Identity (UAMI) for secure, best-practice authentication to Azure SQL Managed Instances.


βœ… Prerequisites

To follow this guide, you’ll need:

  • An Azure SQL Managed Instance with Microsoft Entra (AAD) authentication enabled
  • A User Assigned Managed Identity (UAMI)
  • An Azure VM or App Service to host your app (or test client)
  • The Go-based sqlcmd CLI installed
    β†’ Install guide

πŸ”§ Setting Up the User Assigned Managed Identity (UAMI)

Before connecting to Azure SQL using UAMI, ensure the following steps are completed:

  • Create the UAMI
  • Assign the UAMI to the Virtual Machine(s)
  • Configure Microsoft Entra authentication on the SQL Managed Instance
  • Grant SQL access to the UAMI

These steps can be completed via Azure CLI, PowerShell, or the Azure Portal.


πŸ› οΈ Step 1: Create the User Assigned Managed Identity (UAMI)

βœ… CLI
az identity create \
  --name my-sql-uami \
  --resource-group my-rg \
  --location <region>

Save the Client ID and Object ID β€” you’ll need them later.

βœ… Portal
  1. Go to Azure Portal β†’ Search Managed Identities
  2. Click + Create
  3. Choose Subscription, Resource Group, and Region
  4. Name the identity (e.g., my-sql-uami)
  5. Click Review + Create

πŸ–‡οΈ Step 2: Assign the UAMI to a Virtual Machine

Attach the UAMI to:

  • The VM(s) running your application code
  • The VM used to test the connection
βœ… CLI
az vm identity assign \
  --name my-vm-name \
  --resource-group my-rg \
  --identities my-sql-uami
βœ… Portal
  1. Go to Virtual Machines β†’ Select your VM
  2. Click Identity under Settings
  3. Go to the User assigned tab
  4. Click + Add β†’ Select the UAMI
  5. Click Add

πŸ”‘ Step 3: Configure SQL Managed Instance for Microsoft Entra Authentication

  1. Set an Entra Admin:
  2. Go to your SQL MI β†’ Azure AD admin blade
  3. Click Set admin and choose a user or group
  4. Save changes

  5. Ensure Directory Reader permissions:

  6. Your SQL MI’s managed identity needs Directory Reader access
  7. You can assign this role via Entra ID > Roles and administrators > Directory Readers

More details: Configure Entra authentication


πŸ“œ Step 4: (Optional) Assign Azure Role to the UAMI

This may be needed if the identity needs to access Azure resource metadata or use Azure CLI from the VM.

βœ… CLI
az role assignment create \
  --assignee-object-id <uami-object-id> \
  --role "Reader" \
  --scope /subscriptions/<sub-id>/resourceGroups/<rg-name>
βœ… Portal
  1. Go to the UAMI β†’ Azure role assignments
  2. Click + Add role assignment
  3. Choose role (e.g., Reader)
  4. Set scope
  5. Click Save

πŸ”‘ Step 5: Grant SQL Access to the UAMI

Once the UAMI is assigned to the VM and Entra auth is enabled on SQL MI, log in with an admin and run:

CREATE USER [<client-id>] FROM EXTERNAL PROVIDER;
ALTER ROLE db_datareader ADD MEMBER [<client-id>];
ALTER ROLE db_datawriter ADD MEMBER [<client-id>];

Or use a friendly name:

CREATE USER [my-app-identity] FROM EXTERNAL PROVIDER;
ALTER ROLE db_datareader ADD MEMBER [my-app-identity];

πŸ§ͺ Step 6: Test the Connection Using sqlcmd

sqlcmd \
  -S <your-sql-mi>.database.windows.net \
  -d <database-name> \
  --authentication-method ActiveDirectoryManagedIdentity \
  -U <client-id-of-uami>

If successful, you’ll see the 1> prompt where you can execute SQL queries.


πŸ“Š Step 7: Update Application Code

Update your app to use the UAMI for authentication.

Example connection string for UAMI in C#:

string connectionString = @"Server=tcp:<your-sql-mi>.database.windows.net;" +
                          "Authentication=Active Directory Managed Identity;" +
                          "Encrypt=True;" +
                          "User Id=<your-uami-client-id>;" +
                          "Database=<your-db-name>;";

Make sure your code uses Microsoft.Data.SqlClient with AAD token support.

Or retrieve and assign the token programmatically:

var credential = new DefaultAzureCredential();
var token = await credential.GetTokenAsync(new TokenRequestContext(
    new[] { "https://database.windows.net/" }));

var connection = new SqlConnection("Server=<your-sql-mi>; Database=<your-db-name>; Encrypt=True;");
connection.AccessToken = token.Token;

πŸ”’ Security Benefits

  • πŸ” No credentials stored
  • πŸ” No password rotation
  • πŸ›‘οΈ Entra-integrated access control and auditing

βœ… Summary

By switching to User Assigned Managed Identity, we removed credentials from connection strings and aligned SQL access with best practices for cloud identity and security.

Comments and feedback welcome!

Share on Share on

πŸ“Š Monitoring an IIS-Based Web Farm with Azure Application Insights

In this guide, you'll learn how to:

βœ… Set up Application Insights on an IIS-based web farm.
βœ… Configure Log Analytics, Data Collection Rules, and Data Collection Endpoints.
βœ… Use PowerShell to install the Application Insights agent.
βœ… Monitor live metrics, failures, performance, and logs in real-time.

By the end, you'll have a fully monitored IIS-based web farm using Azure! 🎯


πŸ—οΈ Step 1: Enabling Application Insights on IIS Servers

To effectively monitor your IIS-based application, you need to configure Azure Application Insights and ensure all required components are installed on your Azure VMs.

πŸ› οΈ Prerequisites

Before proceeding, ensure you have:

  • An active Azure Subscription with permissions to create and manage resources.
  • A Log Analytics Workspace (LAW) to store collected telemetry data.
  • Azure Monitor Agent (AMA) installed on your IIS VMs.
  • Necessary permissions to create Data Collection Rules (DCRs) and Data Collection Endpoints (DCEs).

Create a Log Analytics Workspace

  1. Go to Azure Portal β†’ Search for "Log Analytics Workspaces" β†’ Create.
  2. Provide the following details:
  3. Subscription: Select your Azure subscription.
  4. Resource Group: Choose or create a new one.
  5. Name: Enter a unique name (e.g., log-corpapp-prod-uksouth).
  6. Region: Same as your IIS VMs.
  7. Click "Review + Create" and deploy the workspace.

πŸ”— Microsoft Learn: Log Analytics Workspace

Create a Data Collection Endpoint (DCE)

  1. Navigate to Monitor β†’ Data Collection Endpoints.
  2. Click "+ Create" and provide:
  3. Name: e.g., dce-corpapp-prod-uksouth.
  4. Subscription & Resource Group: Same as your IIS VMs.
  5. Region: Same as Log Analytics Workspace.
  6. Review & create the endpoint.

πŸ”— Microsoft Learn: Data Collection Endpoints

Create a Data Collection Rule (DCR)

  1. Go to Monitor β†’ Data Collection Rules β†’ + Create.
  2. Configure:
  3. Name: dcr-corpapp-iis-prod-uksouth
  4. Subscription & Resource Group: Same as above.
  5. Region: Same as DCE & LAW.
  6. Define data sources:
  7. Windows Event Logs: Add System, Application, etc.
  8. Log Levels: Select relevant levels (Error, Warning, Information).
  9. Set Destination:
  10. Choose "Log Analytics Workspace" β†’ Select the previously created workspace.
  11. Associate with IIS VMs (WEB01 - WEB05).
  12. Review & Create the rule.

πŸ”— Microsoft Learn: Data Collection Rules

Install the Azure Monitor Agent (AMA)

  1. Navigate to each IIS VM.
  2. Under "Monitoring", select "Extensions".
  3. Click "+ Add" β†’ AzureMonitorWindowsAgent β†’ Install.
  4. Repeat for all IIS VMs.

πŸ”— Microsoft Learn: Azure Monitor Agent

Enable Application Insights

  1. Navigate to Azure Portal β†’ Search for "Application Insights".
  2. Click "+ Create" β†’ Provide:
  3. Subscription & Resource Group: Same as VMs.
  4. Name: insights-corpapp-prod-uksouth-001.
  5. Region: Same as your IIS VMs.
  6. Application Type: ASP.NET Web Application.
  7. Click "Review + Create" and deploy.

πŸ”— Microsoft Learn: Enable Application Insights

Install the Application Insights Agent

Use the following PowerShell script to install the agent on all of the IIS servers:

# Install the Application Insights Agent
$instrumentationKey = "YOUR-INSTRUMENTATION-KEY"
Install-PackageProvider -Name NuGet -Force
Install-Module -Name ApplicationInsightsWebTracking -Force
Enable-ApplicationInsightsMonitoring -InstrumentationKey $instrumentationKey
Restart-Service W3SVC

πŸ“Š Step 2: Using Application Insights for Monitoring

With everything set up, it's time to monitor and analyze application performance! πŸ”

πŸ“Œ Overview Dashboard

  • Displays high-level health metrics, failed requests, and response times. πŸ“Έ Insights Overview

πŸ“Œ Application Map

  • Shows dependencies and interactions between components. πŸ“Έ Application Map

πŸ“Œ Live Metrics

  • Monitor real-time requests, server performance, and failures. πŸ“Έ Live Metrics

πŸ“Œ Failures & Exceptions

  • Identify and diagnose failed requests & top exceptions. πŸ“Έ Failures & Exceptions

πŸ“Œ Performance Monitoring

  • Analyze response times, dependencies & bottlenecks. πŸ“Έ Performance Overview

πŸ“Œ Logs & Queries

  • Run Kusto Query Language (KQL) queries for deep insights.

Example query to find failed requests:

requests
| where timestamp > ago(24h)
| where success == false
| project timestamp, name, resultCode, url
| order by timestamp desc

πŸ“Έ Query Results


βœ… Next Steps

🎯 Continue monitoring logs & alerts for trends.
🎯 Optimize Application Insights sampling to reduce telemetry costs.
🎯 Automate reporting for key performance metrics.

By following this guide, you'll have a robust, real-time monitoring setup for your IIS web farm, ensuring optimal performance and quick issue resolution! πŸš€

Share on Share on