Skip to content

Blog

🔥 Vibe Coding My Way to AI Connected Infra: Claude, Terraform & Cloud-Native Monitoring

📖 TL;DR – What This Post Covers

  • How I used AI tools to build an Azure-based monitoring solution from scratch
  • Lessons learned from developing two full versions (manual vs. Terraform)
  • The good, bad, and wandering of GenAI for infrastructure engineers
  • A working, cost-effective, and fully redeployable AI monitoring stack

Introduction

This project began, as many of mine do, with a career planning conversation. During a discussion with ChatGPT about professional development and emerging skill areas for 2025, one suggestion stuck with me:

"You should become an Infrastructure AI Integration Engineer."

It’s a role that doesn’t really exist yet — but probably should.

What followed was a journey to explore whether such a role could be real. I set out to build an AI-powered infrastructure monitoring solution in Azure, without any formal development background and using nothing but conversations with Claude. This wasn’t just about building something cool — it was about testing whether a seasoned infra engineer could:

  • Use GenAI to design and deploy a full solution
  • Embrace the unknown and lean into the chaos of LLM-based workflows
  • Create something reusable, repeatable, and useful

The first phase of the journey was a local prototype using my Pi5 and n8n for AI workflow automation (see my previous post for that). It worked — but it was local, limited, and not exactly enterprise-ready.

So began the cloud migration.

Why this project mattered

I had two goals:

  • ✅ Prove that “vibe coding” — using GenAI with limited pre-planning — could produce something deployable
  • ✅ Build a portfolio project for the emerging intersection of AI and infrastructure engineering

This isn’t a tutorial on AI monitoring. Instead, it’s a behind-the-scenes look at what happens when you try to:

  • Build something real using AI chat alone
  • Translate a messy, manual deployment into clean Infrastructure as Code
  • Learn with the AI, not just from it

The Terraform modules prove it works.
The chat logs show how we got there.
The dashboard screenshots demonstrate the outcome.

The next sections cover the journey in two parts: first, the vibe-coded v1; then the Terraform-powered refactor in v2.


📚 Table of Contents

Part 1: The Prototype

(Stage 1 – Manual AI-Assisted Deployment) The Birth of a Vibe-Coded Project

The project didn’t start with a business requirement — it started with curiosity. One evening, mid-career reflection turned into a late-night conversation with ChatGPT:

"You should become an Infrastructure AI Integration Engineer."

I’d never heard the term, but it sparked something. With 20+ years in IT infrastructure and the growing presence of AI in tooling, it felt like a direction worth exploring.

The Thought Experiment

Could I — an infrastructure engineer, not a dev — build an AI-driven cloud monitoring solution:

  • End-to-end, using only AI assistance
  • Without dictating the architecture
  • With minimal manual planning

The rules were simple:

  • ❌ No specifying what resources to use
  • ❌ No formal design documents
  • ✅ Just tell the AI the outcome I wanted, and let it choose the path

The result: pure "vibe coding." Or as I now call it: AI Slop-Ops.

What is Vibe Coding (a.k.a. Slop-Ops)?

For this project, "vibe coding" meant:

  • 🤖 Generating all infrastructure and app code using natural language prompts
  • 🧠 Letting Claude decide how to structure everything
  • 🪵 Learning through experimentation and iteration

My starting prompt was something like:
"I want to build an AI monitoring solution in Azure that uses Azure OpenAI to analyze infrastructure metrics."

Claude replied:

"Let’s start with a simple architecture: Azure Container Apps for the frontend, Azure Functions for the AI processing, and Azure OpenAI for the intelligence. We'll build it in phases."

That one sentence kicked off a 4–5 week journey involving:

  • ~40–50 hours of evening and weekend effort 🧵
  • Dozens of chats, scripts, and browser tabs
  • An unpredictable mix of brilliance and bafflement

And the whole thing started to work.


Version 1: The Manual Deployment Marathon

The first build was fully manual — a mix of PowerShell scripts, Azure portal clicks, and Claude-prompting marathons. Claude suggested a phased approach, which turned out to be the only way to keep it manageable.

💬 Claude liked PowerShell. I honestly can’t remember if that was my idea or if I just went along with it. 🤷‍♂️


Platform and GenAI Choices

🌐 Why Azure?

The platform decision was pragmatic:

  • I already had a Visual Studio Developer Subscription with £120/month of Azure credits.
  • Azure is the cloud provider I work with day-to-day, so it made sense to double down.
  • Using Azure OpenAI gave me hands-on experience with Azure AI Foundry – increasingly relevant in modern infrastructure roles.

In short: low cost, high familiarity, and useful upskilling.


🧠 Why Claude?

This project was built almost entirely through chat with Claude, Anthropic’s conversational AI. I’ve found:

Claude is better at structured technical responses, particularly with IaC and shell scripting.
ChatGPT tends to hallucinate more often in my experience when writing infrastructure code.

But Claude had its own quirks too:

  • No memory between chats — every session required reloading context.
  • Occasional focus issues — drifting from task or overcomplicating simple requests.
  • Tendency to suggest hardcoded values when debugging — needing constant vigilance to maintain DRY principles.

⚠️ Reality check: Claude isn't a Terraform expert. It's a language model that guesses well based on input. The human still needs to guide architecture, validate outputs, and ensure everything actually works.


🤖 Prompt Engineering Principles

I used a consistent framework to keep Claude focused and productive:

  • ROLE: Define Claude's purpose (e.g., “You are a Terraform expert”)
  • INPUT: What files or context is provided
  • OUTPUT: What should Claude return (e.g., a module, refactored block, explanation)
  • CONSTRAINTS: e.g., “No hardcoded values”, “Use locals not repeated variables”
  • TASK: Specific action or generation requested
  • REMINDERS: Extra nudges — “Use comments”, “Output in markdown”, “Use Azure CLI not PowerShell”

This approach reduced misunderstandings and helped prevent “solution drift” during long iterative sessions.


🧱 Phase 1: Foundation

This first phase set up the core infrastructure that everything else would build upon.

🔧 What Got Built

  • Resource Groups – Logical container for resources
  • Storage Accounts – Persistent storage for logs, state, and AI interaction data
  • Log Analytics Workspace – Centralized logging for observability
  • Application Insights – Telemetry and performance monitoring for apps

These services created the backbone of the environment, enabling both operational and analytical insight.


🖥️ PowerShell Verification Script

This example script was used during v1 to manually verify deployment success:

# Verify everything is working
Write-Host "🔍 Verifying Step 1.1 completion..." -ForegroundColor Yellow

# Check resource group
$rg = Get-AzResourceGroup -Name "rg-ai-monitoring-dev" -ErrorAction SilentlyContinue
if ($rg) {
    Write-Host "✅ Resource Group exists" -ForegroundColor Green
} else {
    Write-Host "❌ Resource Group not found" -ForegroundColor Red
}

# Check workspace
$ws = Get-AzOperationalInsightsWorkspace -ResourceGroupName "rg-ai-monitoring-dev" -Name "law-ai-monitoring-dev" -ErrorAction SilentlyContinue
if ($ws -and $ws.ProvisioningState -eq "Succeeded") {
    Write-Host "✅ Log Analytics Workspace is ready" -ForegroundColor Green
} else {
    Write-Host "❌ Log Analytics Workspace not ready. State: $($ws.ProvisioningState)" -ForegroundColor Red
}

# Check config file
if (Test-Path ".\phase1-step1-config.json") {
    Write-Host "✅ Configuration file created" -ForegroundColor Green
} else {
    Write-Host "❌ Configuration file missing" -ForegroundColor Red
}

🧠 Phase 2: Intelligence Layer

With the foundation in place, the next step was to add the brainpower — the AI and automation components that turn infrastructure data into actionable insights.

🧩 Key Components

  • Azure OpenAI Service
  • Deployed with gpt-4o-mini to balance cost and performance
  • Powers the natural language analysis and recommendation engine

  • Azure Function App

  • Hosts the core AI processing logic
  • Parses data from monitoring tools and feeds it to OpenAI
  • Returns interpreted insights in a format suitable for dashboards and alerts

  • Logic Apps

  • Automates data ingestion and flow between services
  • Orchestrates the processing of logs, telemetry, and alert conditions
  • Acts as glue between Function Apps, OpenAI, and supporting services

🗣️ AI Integration Philosophy

This stage wasn’t about building complex AI logic — it was about using OpenAI to interpret patterns in infrastructure data and return intelligent summaries or recommendations in natural language.

Example prompt fed to OpenAI from within a Function App:

“Based on this log stream, are there any signs of service degradation or performance issues in the last 15 minutes?”

The response would be embedded in a monitoring dashboard or sent via alert workflows, giving human-readable insights without manual interpretation.


⚙️ Why This Setup?

Each component in this layer was chosen for a specific reason:

  • OpenAI for flexible, contextual intelligence
  • Function Apps for scalable, event-driven execution
  • Logic Apps for orchestration without writing custom backend code

This approach removed the need for always-on VMs or bespoke integrations — and kept things lean.


📌 By the end of Phase 2, the system had a functioning AI backend that could interpret infrastructure metrics in plain English and respond in near real-time.

🎨 Phase 3: The User Experience

With the core infrastructure and AI processing in place, it was time to build the frontend — the visible interface for users to interact with the AI-powered monitoring system.

This phase focused on deploying a set of containerized applications, each responsible for a specific role in the monitoring workflow.


🧱 Components Deployed

The solution was built around Azure Container Apps, with a four-container ecosystem designed to work in harmony:

  • FastAPI Backend
    Handles API requests, routes data to the correct services, and acts as the core orchestrator behind the scenes.

  • React Dashboard
    A clean, responsive frontend displaying infrastructure metrics, system health, and AI-generated insights.

  • Background Processor
    Continuously monitors incoming data and triggers AI evaluations when certain thresholds or patterns are detected.

  • Load Generator
    Provides synthetic traffic and test metrics to simulate real usage patterns and help validate system behavior.


🔄 Why This Architecture?

Each container serves a focused purpose, allowing for:

  • Isolation of concerns — easier debugging and development
  • Scalable deployment — each component scales independently
  • Separation of UI and logic — keeping the AI and logic layers decoupled from the frontend

“Claude recommended this separation early on — the decision to use Container Apps instead of AKS or App Services kept costs down and complexity low, while still providing a modern cloud-native experience.”


⚙️ Deployment Highlights

Container Apps were provisioned via CLI in the manual version, and later through Terraform in v2. The deployment process involved:

  • Registering a Container Apps Environment
  • Creating the four separate app containers
  • Passing environment variables for API endpoints, keys, and settings
  • Enabling diagnostics and logging via Application Insights
az containerapp create \
  --name react-dashboard \
  --image myregistry.azurecr.io/dashboard:latest \
  --env-vars REACT_APP_API_URL=https://api.example.com

📊 Final Result

Once deployed, the user-facing layer provided:

  • 🔍 Real-time visual metrics
  • 💡 AI-generated recommendations
  • 🧠 Interactive analysis via chat
  • 📈 Infrastructure performance summaries
  • 💬 Stakeholder-friendly reporting

This phase brought the system to life — from backend AI logic to a polished, interactive dashboard.

🤖 The Reality of AI-Assisted Development

Here's what the success story doesn’t capture: the relentless battles with Claude’s limitations.

Despite its capabilities, working with GenAI in a complex, multi-phase project revealed real friction points — especially when continuity and context were critical.

😫 Daily Frustrations Included
  • 🧱 Hitting chat length limits daily — even with Claude Pro
  • 🧭 AI meandering off-topic, despite carefully structured prompts
  • 📚 Over-analysis — asking for one thing and receiving a detailed architectural breakdown
  • ⚙️ Token burn during troubleshooting — Claude often provided five-step fixes when a one-liner was needed
  • No persistent memory or project history
  • This meant manually copy/pasting prior chats into a .txt file just to refeed them back in
  • 🔁 Starting new chats daily — and re-establishing context from scratch every time
  • 📏 Scope creep — Claude regularly expanded simple requests into full architectural reviews without being asked

Despite these pain points, the experience was still a net positive — but only because I was prepared to steer the conversation firmly and frequently. Chat length limit warning

🧪 From Real-World Troubleshooting

Sometimes, working with Claude felt like pair programming with a colleague who had perfect recall — until they completely wiped their memory overnight.

🧵 From an actual troubleshooting session:

“The dashboard is calling the wrong function URL again.
It’s trying to reach func-tf-ai-monitoring-dev-ai,
but the actual function is at func-ai-monitoring-dev-ask6868-ai.”

It was a recurring theme: great memory during a session, zero continuity the next day.

Me: “Right, shall we pick up where we left off yesterday then?”
Claude: “I literally have no idea what you're talking about, mate.”
Claude: “Wait, who are you again?”

Every failure taught both me and Claude something — but the learning curve was steep, and the iteration cycles could be genuinely exhausting.

Version 1 - Deployed & Working

AI Monitoring Dashboard V1

🧠 What I Learned from Part 1

Reflecting on the first phase of this project — the manual, vibe-coded deployment — several key takeaways emerged.

✅ What Worked Well
  • Rapid prototyping — quickly turned ideas into functioning infrastructure
  • 💬 Natural language problem-solving — great for tackling Azure’s complex service interactions
  • 🧾 Syntactically sound code generation — most outputs worked with minimal tweaks
  • ⏱️ Massive time savings — tasks that might take days manually were completed in hours
🔍 What Needed Constant Oversight
  • 🧠 Keeping the AI focused — drift and distraction were constant threats
  • 🔗 Managing dependencies and naming — conflicts and collisions needed manual intervention
  • 🐛 Debugging runtime issues — particularly frustrating when errors only manifested in Azure
  • 🧭 Architectural decisions — strategic direction still had to come from me
  • ⚠️ Knowing when “it works” wasn’t “production-ready” — validation remained a human job
🛠️ Language & Tooling Choices

Interestingly, Claude dictated the stack more than I did.

  • Version 1 leaned heavily on PowerShell
  • Version 2 shifted to Azure CLI and Bash

Despite years of experience with PowerShell, I found Claude was significantly more confident (and accurate) when generating Azure CLI or Bash-based commands. This influenced the eventual choice to move away from PowerShell in the second iteration.


By the end of Part 1, I had a functional AI monitoring solution — but it was fragile, inconsistent, and impossible to redeploy without repeating all the manual steps.

That realisation led directly to Version 2 — a full rebuild using Infrastructure as Code.


🌍 Part 2: Why Terraform? Why Now?

After several weeks of manual deployments, the limitations of version 1 became unmissable.

Yes — the system worked — but only just:

  • Scripts were fragmented and inconsistent
  • Fixes required custom, ad-hoc scripts created on the fly
  • Dependencies weren’t tracked, and naming conflicts crept in
  • Reproducibility? Practically zero

🚨 The deployment process had become unwieldy — a sprawl of folders, partial fixes, and manual interventions. Functional? Sure. Maintainable? Absolutely not.


That’s when the Infrastructure as Code (IaC) mindset kicked in.

“Anything worth building once is worth building repeatably.”

The question was simple:
💡 Could I rebuild everything from scratch — but this time, using AI assistance to create clean, modular, production-ready Terraform code?


🧱 The Terraform Challenge

Rebuilding in Terraform wasn’t just a choice of tooling — it was a challenge to see how far AI-assisted development could go when held to production-level standards.

🎯 Goals of the Terraform Rewrite
  • Modularity
    Break down the monolithic structure into reusable, isolated modules
  • Portability
    Enable consistent deployment across environments and subscriptions
  • DRY Principles
    Absolutely no hardcoded values or duplicate code
  • Documentation
    Ensure the code was clear, self-documenting, and reusable by others

Terraform wasn’t just a tech choice — it became the refinement phase.
A chance to take what I’d learned from the vibe-coded version and bake that insight into clean, structured infrastructure-as-code.

Next: how AI and I tackled that rebuild, and the (sometimes surprising) choices we made.

🧠 The Structured Prompt Approach

The prompt engineering approach became absolutely crucial during the Terraform refactoring phase.

Rather than relying on vague questions or “do what I mean” instructions, I adopted a structured briefing style — the kind you might use when assigning work to a consultant:

  • Define the role
  • Set the goals
  • Describe the inputs
  • Outline the method
  • Impose constraints

Here’s the actual instruction prompt I used to initiate the Terraform rebuild 👇

🔧 Enhanced Prompt: AI Monitoring Solution IaC Refactoring Project

👤 Role Definition
You are acting as:
 An Infrastructure as Code (IaC) specialist with deep expertise in Terraform
 An AI integration engineer, experienced in deploying Azure-based AI workloads

Your responsibilities are:
 To refactor an existing AI Monitoring solution from a manually built prototype 
  into a modular, efficient, and portable Terraform project
 To minimize bloat, ensure code reusability, and produce clear documentation 
  to allow redeployment with minimal changes

🎯 Project Goals
 Rebuild the existing AI Monitoring solution as a fully modular, DRY-compliant 
  Terraform deployment
 Modularize resources (OpenAI, Function Apps, Logic Apps, Container Apps) 
  into reusable components
 Provide clear, concise README.md files for each module describing usage, 
  input/output variables, and deployment steps

📁 Project Artifacts (Input)
The following components are part of the original Azure-hosted AI Monitoring solution:
 Azure OpenAI service
 Azure Function App
 Logic App
 Web Dashboard
 Container Apps Environment
 Supporting components (Key Vaults, App Insights, Storage, etc.)

🛠️ Approach / Methodology
For each module:
 Use minimal but complete resource blocks
 Include only essential variables with sensible defaults
 Use output values to export key resource properties
 Follow DRY principles using locals or reusable variables where possible

📌 Additional Guidelines
 Efficiency first: Avoid code repetition; prefer reusability, locals, and input variables
 Practical defaults: Pre-fill variables with production-safe, but general-purpose values
 Keep it modular: No monolithic deployment blocks—use modules for all core resources
 Strict adherence: Do not expand scope unless confirmed

This structured approach helped maintain focus and provided clear boundaries for the AI to work within — though, as you'll see, constant reinforcement was still required throughout the process.


🔄 The Refactoring Process

The Terraform rebuild became a different kind of AI collaboration.

Where version 1 was about vibing ideas into reality, version 2 was about methodically translating a messy prototype into clean, modular, production-friendly code.


🧩 Key Modules Created
  • foundation
    Core infrastructure — resource groups, storage accounts, logging, etc.

  • openai
    Azure OpenAI resource and model deployment — central to the intelligent analysis pipeline

  • function-app
    Azure Functions for AI processing — connecting telemetry with insights

  • container-apps
    Four-container ecosystem — the user-facing UI and visualization layers

  • monitoring
    Application Insights + alerting — keeping the system observable and maintainable


📁 Modular Structure Overview
terraform-ai-monitoring/
├── modules/
   ├── foundation/
   ├── openai/
   ├── function-app/
   └── container-apps/
├── main.tf
└── terraform.tfvars

Each module went through multiple refinement cycles. The goal wasn’t just to get it working — it was to ensure:

  • Clean, reusable Terraform code
  • Explicit configuration
  • DRY principles throughout
  • Reproducible, idempotent deployments

🔧 Iterative Refinement in Practice

A typical troubleshooting session went something like this:

  • I’d run the code or attempt a terraform plan or apply.
  • If there were no errors, I’d verify the outcome and move on.
  • If there were errors, I’d copy the output into Claude and we’d go back and forth trying to fix the problem.

This is where things often got tricky. Claude would sometimes suggest hardcoded values despite earlier instructions to avoid them, or propose overly complex fixes instead of the simple, obvious one. Even with clear guidance in the prompt, it was a constant effort to keep the AI focused and within scope.

The process wasn’t just code generation — it was troubleshooting, adjusting, and rechecking until things finally worked as expected.

Terraform schema correction

The process revealed both the strengths and limitations of AI-assisted Infrastructure as Code development.


🧠 Part 3: Working with GenAI – The Good, the Bad, and the Wandering

Building two versions of the same project entirely through AI conversations provided unique insights into the practical realities of AI-assisted development.

This wasn’t the utopian "AI will do everything" fantasy — nor was it the cynical "AI can’t do anything useful" view.
It was somewhere in between: messy, human, instructive.


✅ The Good: Where AI Excelled

⚡ Rapid prototyping and iteration
Claude could produce working infrastructure code faster than I could even open the Azure documentation.
Need a Container App with specific environment variables? ✅ Done.
Modify the OpenAI integration logic? ✅ Updated in seconds.

🧩 Pattern recognition and consistency
Once Claude grasped the structure of the project, it stuck with it.
Variable names, tagging conventions, module layout — it stayed consistent without me needing to babysit every decision.

🛠️ Boilerplate generation
Claude churned out huge volumes of code across Terraform, PowerShell, React, and Python — all syntactically correct and logically structured, freeing me from repetitive coding.


❌ The Bad: Where AI Struggled

🧠 Context drift and prompt guardrails
Even with structured, detailed instructions, Claude would sometimes go rogue:

  • Proposing solutions for problems I hadn’t asked about
  • Rewriting things that didn’t need fixing
  • Suggesting complete redesigns for simple tweaks

🎉 Over-enthusiasm
Claude would often blurt out things like:

“CONGRATULATIONS!! 🎉 You now have a production-ready AI Monitoring platform!”
To which I’d reply:
“Er, no bro. We're nowhere near done here. Still Cuz.”

(Okay, I don’t really talk to Claude like a GenZ wannabe Roadman — but you get the idea 😂)

🐛 Runtime debugging limitations
Claude could write the code. But fixing things like:

  • Azure authentication issues
  • Misconfigured private endpoints
  • Resource naming collisions
    …was always on me. These weren’t things AI could reliably troubleshoot on its own.

🔁 Project continuity fail
There’s no persistent memory.
Every new session meant reloading context from scratch — usually by copy-pasting yesterday’s chat into a new one.
Tedious, error-prone, and inefficient.


🌀 The Wandering: Managing AI Attention

⚠️ Fundamental challenge: No memory
Claude has no memory beyond the current chat. Even structured prompts didn’t prevent “chat drift” unless I constantly reinforced boundaries. This is where ChatGPT has an edge in my opiion. If I ask about previous chats, ChatGPT can give me examples and context about chats we had previously if prompted.

🎯 The specificity requirement
Vague:

"Fix the container deployment"
Resulted in:
"Let’s rebuild the entire architecture from scratch" 😬

Precise:

"Update the environment variable REACT_APP_API_URL in container-apps module"
Got the job done.

🚫 The hardcoded value trap
Claude loved quick fixes — often hardcoding values just to “make it work”.
I had to go back and de-hardcode everything to stay true to the DRY principles I set from day one.

**⏳ Time impact for non-devs Both stages of the project took longer than they probably should have — not because of any one flaw, but because of the nature of working with AI-generated infrastructure code.

A seasoned DevOps engineer might have moved faster by spotting bugs earlier and validating logic more confidently. But a pure developer? Probably not. They’d likely struggle with the Azure-specific infrastructure decisions, access policies, and platform configuration that were second nature to me.

This kind of work sits in a grey area — it needs both engineering fluency and platform experience. The real takeaway? GenAI can bridge that gap in either direction, but whichever way you’re coming from, there’s a learning curve.

The cost: higher validation effort.
The reward: greater independence and accelerated learning.


🏗️ Part 4: Building The Stack - What Got Built

The final Terraform solution creates a fully integrated AI monitoring ecosystem in Azure — one that’s modular, intelligent, and almost production-ready.
Here’s what was actually built — and why.


🔧 Core Architecture

🧠 Azure OpenAI Integration
At the heart of the system is GPT-4o-mini, providing infrastructure analysis and recommendations at a significantly lower cost than GPT-4 — without compromising on quality for this use case.

📦 Container Apps Environment
Four lightweight, purpose-driven containers manage the monitoring workflow:

  • ⚙️ FastAPI backend – Data ingestion and processing
  • 📊 React dashboard – Front-end UI and live telemetry
  • 🔄 Background processor – Continuously monitors resource health
  • 🚀 Load generator – Simulates traffic for stress testing and metrics

⚡ Azure Function Apps for AI Processing
Serverless compute bridges raw telemetry with OpenAI for analysis.
Functions scale on demand, keeping costs low and architecture lean.

⚠️ The only part of the project not handled in Terraform was the custom dashboard container build. That's by design — Terraform isn’t meant for image building or pushing. Instead, I handled that manually (or via CI pipeline), which aligns with Hashicorps best practices.


🧰 Supporting Infrastructure

  • Application Insights – Real-time telemetry for diagnostics
  • Log Analytics – Centralised logging and query aggregation
  • Azure Container Registry (ACR) – Stores and serves custom container images
  • Key Vault – Secrets management for safe credential handling

🤔 Key Technical Decisions

🆚 Why Container Apps instead of AKS?
Honestly? Claude made the call.
When I described what I needed (multi-container orchestration without complex ops), Claude recommended Container Apps over AKS, citing:

  • Lower cost
  • Simpler deployment
  • Sufficient capability for this workload

And… Claude was right. AKS would have been overkill.

💸 Why GPT-4o-mini over GPT-4?
This was a no-brainer. GPT-4o-mini gave near-identical results for our monitoring analysis — at a fraction of the cost.
Perfect balance of performance and budget.

📦 Why modular Terraform over monolithic deployment?
Because chaos is not a deployment strategy.
Modular code = clean boundaries, reusable components, and simple environment customization.
It’s easier to debug, update, and scale.


🧮 Visual Reference

Below are visuals captured during project development and testing:

🔹 VS Code project structure
VS Code project structure

🔹 Claude Projects interface
Claude Projects interface


📊 What the Dashboard Shows

The final React-based dashboard delivers:

  • Real-time API health checks
  • 🧠 AI-generated infrastructure insights
  • 📈 Performance metrics + trend analysis
  • 💬 Interactive chat with OpenAI
  • 📤 Exportable chats for analysis

🔹 Dashboard – Full view
Dashboard Full View

🔹 AI analysis in progress
Dashboard AI analysis 2

🔹 OpenAI response card
OpenAI response


🧾 Part 5: The Result - A Portable, Reusable AI Monitoring Stack

The final Terraform deployment delivers a complete, modular, and production-friendly AI monitoring solution — fully reproducible across environments. More importantly, it demonstrates that AI-assisted infrastructure creation is not just viable, but effective when paired with good practices and human oversight.


🚀 Deployment Experience

From zero to running dashboard:
~ 15 minutes (give or take 30-40 hours 😂)

terraform init
terraform plan
terraform apply

Minimal configuration required:

  • ✅ Azure subscription credentials
  • 📄 Terraform variables (project name, region, container image names, etc.)
  • 🐳 Container image references (can use defaults or custom builds)

🗺️ Infrastructure Overview

The final deployment provisions a complete, AI-driven monitoring stack — built entirely with Infrastructure as Code and connected through modular Terraform components.

🔹 Azure Resource Visualizer
Azure Resource Visualizer


💰 Cost Optimization

This solution costs ~£15 per month for a dev/test deployment (even cheaper if you remember to turn the container apps off!😲) — vastly cheaper than typical enterprise-grade monitoring tools (which can range £50–£200+ per month).

Key savings come from:

  • ⚡ Serverless Functions instead of always-on compute
  • 📦 Container Apps that scale to zero during idle time
  • 🤖 GPT-4o-mini instead of GPT-4 (with negligible accuracy trade-off)

🔁 Portability Validation

The real benefit of this solution is in its repeatability:

Dev environment
UK South, full-feature stack

Test deployment
New resource group, same subscription — identical results

Clean subscription test
Fresh environment, zero config drift

Conclusion:
No matter where or how it's deployed, the stack just works.


🧠 Part 6: Reflections and Lessons Learned

Building the same solution twice — once manually, once using Infrastructure as Code — offered a unique lens through which to view both AI-assisted development and modern infrastructure practices.


🤖 On AI-Assisted Development

🔎 The reality check
AI-assisted development is powerful but not autonomous. It still relies on:

  • Human oversight
  • Strategic decisions
  • Recognizing when the AI is confidently wrong

⚡ Speed vs. Quality
AI can produce working code fast — sometimes scarily fast — but:

  • The validation/debugging can take longer than traditional coding
  • The real power lies in architectural iteration, not production-readiness

📚 The learning curve
Truthfully, both v1 and v2 took much longer than they should have.
A seasoned developer with better validation skills could likely complete either project in half the time — by catching subtle issues earlier.


🛠️ On Infrastructure as Code

📐 The transformation
Switching to Terraform wasn’t just about reusability:

  • It encouraged cleaner design, logical resource grouping, and explicit dependencies
  • It forced better decisions

🧩 The hidden complexity
What looked simple in Terraform:

  • Revealed just how messy the manual deployment had been
  • Every implicit assumption, naming decision, and “just click here” moment had to become codified and reproducible

🎭 On Vibe Coding as a Methodology

✅ What worked:

  • Rapid architectural exploration
  • Solving problems in plain English
  • Iterative builds based on feedback
  • AI-assisted speed gains (things built in hours, not days)

❌ What didn’t:

  • Continuity across chat sessions
  • Preserving project context
  • Runtime debugging in Azure
  • Keeping the agent focused on scoped tasks

🔁 Things I’d Do Differently

🧾 Better structured prompting from the outset
While I used a defined structure for the AI prompt, I learned:

  • Even good prompts require ongoing reinforcement
  • Claude needed regular reminders to stay on track during long sessions

✅ Regular resource validation
A recurring challenge:

  • Claude often over-provisioned services
  • Periodic reviews of what we were building helped cut waste and simplify architecture

🧠 The reality of AI memory limitations
No, the AI does not “remember” anything meaningful between sessions:

  • Every day required rebuilding the conversation context
  • Guardrails had to be restated often

🎯 The extreme specificity requirement
Vague asks = vague solutions
But:

  • Precise requests like “update REACT_APP_API_URL in container-apps module” yielded laser-targeted results

✅ Conclusion

This project started as a career thought experiment — “What if there was a role focused on AI-integrated infrastructure?” — and ended with a fully functional AI monitoring solution deployed in Azure.

What began as a prototype on a local Pi5 evolved into a robust, modular Terraform deployment. Over 4–5 weeks, it generated thousands of lines of infrastructure code, countless iterations, and a treasure trove of insights into AI-assisted development.


🚀 The Technical Outcome

The result is a portable, cost-effective, AI-powered monitoring system that doesn’t just work — it proves a point. It's not quite enterprise-ready, but it’s a solid proof-of-concept and a foundation for learning, experimentation, and future iteration.


🧠 Key Takeaways

  1. AI-assisted development is powerful — but not autonomous.
    It requires constant direction, critical oversight, and the ability to spot when the AI is confidently wrong.

  2. Infrastructure as Code changes how you architect.
    Writing Terraform forces discipline: clean structure, explicit dependencies, and reproducible builds.

  3. Vibe coding has a learning curve.
    Both versions took longer than expected. A seasoned developer could likely move faster — but for infra pros, this is how we learn.

  4. Context management is still a major limitation.
    The inability to persist AI session memory made long-term projects harder than they should have been.

  5. The role of “Infrastructure AI Integration Engineer” is real — and emerging.
    This project sketches out what that future job might involve: blending IaC, AI, automation, and architecture.


🧭 What’s Next?

Version 3 is already brewing ☕ — ideas include:

  • Monitoring more Azure services
  • Improving the dashboard’s AI output formatting
  • Experimenting with newer tools like Claude Code and ChatGPT Codex
  • Trying AI-native IDEs and inline assistants to streamline the workflow

And let’s not forget the rise of “Slop-Ops” — that beautiful mess where AI, infrastructure, and vibe-based engineering collide 😎


💡 Final Thoughts

If you're an infrastructure engineer looking to explore AI integration, here’s the reality:

  • The tools are ready.
  • The method works.
  • But it’s not magic — it takes effort, patience, and curiosity.

The future of infrastructure might be conversational — but it’s not (yet) automatic.


If you’ve read this far — thanks. 🙏 I’d love feedback from anyone experimenting with AI-assisted IaC or Terraform refactors. Find me on [LinkedIn] or leave a comment.

Share on Share on

🍓 Building AI-Powered Infrastructure Monitoring: From Home Lab to Cloud Production

After successfully diving into AI automation with n8n (and surviving the OAuth battles), I decided to tackle a more ambitious learning project: exploring how to integrate AI into infrastructure monitoring systems. The goal was to understand how AI can transform traditional monitoring from simple threshold alerts into intelligent analysis that provides actionable insights—all while experimenting in a safe home lab environment before applying these concepts to production cloud infrastructure.

What you'll discover in this post:

  • Complete monitoring stack deployment using Docker Compose
  • Prometheus and Grafana setup for metrics collection
  • n8n workflow automation for data processing and AI analysis
  • Azure OpenAI integration for intelligent infrastructure insights
  • Professional email reporting with HTML templates
  • Lessons learned for transitioning to production cloud environments
  • Practical skills for integrating AI into traditional monitoring workflows

Here's how I built a home lab monitoring system to explore AI integration patterns that can be applied to production cloud infrastructure.

Full disclosure: I'm using a Visual Studio Enterprise subscription which provides £120 monthly Azure credits. This makes Azure OpenAI experimentation cost-effective for learning purposes. I found direct OpenAI API connections too expensive for extensive experimentation.


🎯 Prerequisites & Planning

Before diving into the implementation, let's establish what you'll need and the realistic time investment required for this learning project.

Realistic Learning Prerequisites

Essential Background Knowledge:

Docker & Containerization:

  • Can deploy multi-container applications with Docker Compose
  • Understand container networking and volume management
  • Can debug why containers can't communicate with each other
  • Familiar with basic Docker commands (logs, exec, inspect)
  • Learning Resource: Docker Official Tutorial - Comprehensive introduction to containerization

API Integration:

  • Comfortable making HTTP requests with authentication headers
  • Can read and debug JSON responses
  • Understand REST API concepts and error handling
  • Experience with tools like curl or Postman for API testing
  • Learning Resource: REST API Tutorial - Complete guide to RESTful services

Infrastructure Monitoring Concepts:

  • Know what CPU, memory, and disk metrics actually represent
  • Understand the difference between metrics, logs, and traces
  • Familiar with the concept of time-series data
  • Basic understanding of what constitutes "normal" vs "problematic" system behavior
  • Learning Resource: Prometheus Documentation - Monitoring fundamentals and concepts

Skills You'll Develop During This Project:

  • AI prompt engineering for infrastructure analysis
  • Workflow automation with complex orchestration
  • Integration of traditional monitoring with modern AI services
  • Business communication of technical metrics
  • Cost-conscious AI service usage and optimization

Community Learning Resources:

Honest Time Investment Expectations

If you have all prerequisites: 1-2 weeks for complete implementation

  • Basic setup: 2-3 hours
  • AI integration: 4-6 hours
  • Customization and optimization: 6-8 hours
  • Cloud transition planning: 4-6 hours

If missing Docker skills: Add 2-3 weeks for learning fundamentals

If new to monitoring: Add 1-2 weeks for infrastructure concepts

If unfamiliar with APIs: Add 1 week for HTTP/JSON basics

  • REST API fundamentals: 3-4 days
  • JSON manipulation practice: 2-3 days
  • Authentication concepts: 1-2 days
  • Recommended Learning: HTTP/REST API Tutorial and hands-on practice with JSONPlaceholder

Hardware & Service Requirements

Minimum Configuration:

  • Hardware: 8GB+ RAM (Pi 5 8GB or standard x86 machine)
  • Storage: 100GB+ available space (containers + metrics retention)
  • Network: Stable internet connection with static IP preferred
  • Software: Docker 20.10+, Docker Compose 2.0+

Service Account Setup for Learning:

  • Azure OpenAI: Azure subscription with OpenAI access (Visual Studio Enterprise subscription provides excellent experimentation credits)
  • Email Provider: Gmail App Password works perfectly for testing
  • Cloud Account: AWS/Azure free tier for eventual cloud transition
  • Monitoring Tools: All open-source options used in this project

Learning Environment Costs:

  • Azure OpenAI: Covered by Visual Studio Enterprise subscription credits
  • Infrastructure: Minimal electricity costs for Pi 5 operation
  • Email: £0 (using Gmail App Password)
  • Total out-of-pocket: Essentially £0 for extensive experimentation

Note: Without subscription benefits, AI analysis costs should be carefully monitored as they can accumulate with frequent polling.

My Learning Setup

🖥️ Home Lab Environment

  • Primary system: Raspberry Pi 5, 8GB RAM (24/7 learning host)
  • Development approach: Iterative experimentation with immediate feedback
  • Network setup: Standard home lab environment
  • Learning support: Anthropic Claude for debugging and optimization

🏗️ Phase 1: Foundation - The Monitoring Stack

Building any intelligent monitoring system starts with having something intelligent to monitor. Enter the classic Prometheus + Grafana combo, containerized for easy deployment and scalability.

The foundation phase establishes reliable metric collection before adding intelligence layers. This approach ensures we have clean, consistent data to feed into AI analysis rather than trying to retrofit intelligence into poorly designed monitoring systems.

✅ Learning Checkpoint: Before You Begin

Before starting this project, verify you can:

  • [ ] Deploy a multi-container application with Docker Compose
  • [ ] Debug why a container can't reach another container
  • [ ] Make API calls with authentication headers using curl
  • [ ] Read and understand JSON data structures
  • [ ] Explain what CPU and memory metrics actually mean for system health

Quick Test: Can you deploy a simple web application stack (nginx + database) using Docker Compose and troubleshoot networking issues? If not, spend time with Docker fundamentals first.

Common Issue at This Stage: Container networking problems are the most frequent stumbling block. If containers can't communicate, review Docker Compose networking documentation and practice with simple multi-container applications before proceeding.

🐳 Docker Compose Infrastructure

The entire monitoring stack deploys through a single Docker Compose file. This approach ensures consistent environments from home lab development through cloud production.

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - prometheus_data:/prometheus
      - ./config/prometheus.yml:/etc/prometheus/prometheus.yml:ro
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=7d'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    networks:
      - monitoring

  grafana:
    image: grafana/grafana-oss:latest
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=aimonitoring123
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./config/grafana/provisioning:/etc/grafana/provisioning:ro
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - monitoring

  n8n:
    image: n8nio/n8n:latest
    container_name: n8n
    restart: unless-stopped
    ports:
      - "5678:5678"
    environment:
      - N8N_BASIC_AUTH_ACTIVE=true
      - N8N_BASIC_AUTH_USER=admin
      - N8N_BASIC_AUTH_PASSWORD=aimonitoring123
      - N8N_HOST=0.0.0.0
      - N8N_PORT=5678
      - N8N_PROTOCOL=http
    volumes:
      - n8n_data:/home/node/.n8n
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:
  n8n_data:

networks:
  monitoring:
    driver: bridge

⚙️ Prometheus Configuration

The heart of metric collection needs careful configuration to balance comprehensive monitoring with resource efficiency:

# config/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "rules/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
    scrape_interval: 5s
    metrics_path: /metrics

  - job_name: 'grafana'
    static_configs:
      - targets: ['grafana:3000']
    scrape_interval: 30s

alerting:
  alertmanagers:
    - static_configs:
        - targets: []

🚀 Deployment and Initial Setup

Launch your monitoring foundation with a single command:

# Create project structure
mkdir ai-monitoring-lab && cd ai-monitoring-lab
mkdir -p config/grafana/provisioning/{datasources,dashboards}
mkdir -p data logs

# Deploy the stack
docker-compose up -d

# Verify deployment
docker-compose ps
docker-compose logs -f prometheus

Home lab Grafana dashboard showing real system metrics with proper visualization and alerting thresholds

Within minutes, you'll have comprehensive system metrics flowing through Prometheus and visualized in Grafana. But pretty graphs are just the beginning—the real transformation happens when we add AI analysis.

✅ Learning Checkpoint: Monitoring Foundation

Before proceeding to workflow automation, verify you can:

  • [ ] Access Prometheus at localhost:9090 and see targets as "UP"
  • [ ] View system metrics in Grafana dashboards
  • [ ] Write basic PromQL queries (like node_memory_MemAvailable_bytes)
  • [ ] Understand what the metrics represent in business terms
  • [ ] Create a custom Grafana panel showing memory usage as a percentage

Quick Test: Create a dashboard panel that shows "Memory utilization is healthy/concerning" based on percentage thresholds. If you can't do this easily, spend more time with Prometheus queries and Grafana visualization.

Common Issues at This Stage:

  • Prometheus targets showing as "DOWN" - Usually container networking or firewall issues
  • Grafana showing "No data" - Often datasource URL configuration problems
  • PromQL query errors - Syntax issues with metric names or functions

Prometheus targets page demonstrating successful metric collection from all configured endpoints


🔗 Bridging to Intelligence: Why Traditional Monitoring Isn't Enough

Traditional monitoring tells you what happened (CPU is at 85%) but not why it matters or what you should do about it. Most alerts are just noise without context about whether that 85% CPU usage is normal for your workload or a sign of impending system failure.

This is where workflow automation and AI analysis bridge the gap between raw metrics and actionable insights.

What n8n brings to the solution:

  • Orchestrates data collection from multiple sources beyond just Prometheus
  • Transforms raw metrics into structured data suitable for AI analysis
  • Handles error scenarios and fallbacks gracefully without custom application development
  • Enables complex logic through visual workflows rather than scripting
  • Provides integration capabilities with email, chat systems, and other tools

Why AI analysis matters:

  • Adds context: "85% CPU usage is normal for this workload during business hours"
  • Predicts trends: "Memory usage trending upward, recommend capacity review in 2 weeks"
  • Communicates impact: "System operating efficiently with no immediate business impact"
  • Reduces noise: Only alert on situations that actually require attention

The combination creates a monitoring system that doesn't just detect problems—it explains them in business terms and recommends specific actions.


🤖 Phase 2: n8n Workflow Automation - The Intelligence Orchestrator

n8n transforms our basic monitoring stack into an intelligent analysis system. Through visual workflow design, we can create complex logic without writing extensive custom code.

Complete n8n workflow canvas displaying both alert and scheduled reporting paths with visible node connections

⏰ Data Collection: The Foundation Nodes

The workflow begins with intelligent data collection that fetches exactly the metrics needed for AI analysis:

// Schedule Trigger Node Configuration
{
  "rule": {
    "interval": [
      {
        "field": "cronExpression",
        "value": "0 */1 * * *"  // Every hour
      }
    ]
  }
}

Prometheus Query Node (HTTP Request):

{
  "url": "http://prometheus:9090/api/v1/query",
  "method": "GET",
  "qs": {
    "query": "((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes) * 100"
  },
  "options": {
    "timeout": 10000
  }
}

🔄 Process Metrics: The Data Transformation

The magic happens in the data processing node, which transforms Prometheus's JSON responses into clean, AI-friendly data structures. This step is crucial—AI analysis is only as good as the data you feed it.

// Process Metrics Node - JavaScript Code
const input = $input.first();

try {
  // Extract metric values from Prometheus response
  const memoryData = input.json.data.result[0];
  const memoryPercent = parseFloat(memoryData.value[1]).toFixed(2);

  // Determine system health status
  let alertLevel, alertStatus, systemHealth;

  if (memoryPercent < 60) {
    alertLevel = 'LOW';
    alertStatus = 'HEALTHY';
    systemHealth = 'optimal';
  } else if (memoryPercent < 80) {
    alertLevel = 'MEDIUM';
    alertStatus = 'WATCH';
    systemHealth = 'elevated but manageable';
  } else {
    alertLevel = 'HIGH';
    alertStatus = 'CRITICAL';
    systemHealth = 'requires immediate attention';
  }

  // Structure data for AI analysis
  const processedData = {
    timestamp: new Date().toISOString(),
    memory_percent: parseFloat(memoryPercent),
    alert_level: alertLevel,
    alert_status: alertStatus,
    system_health: systemHealth,
    collection_source: 'prometheus',
    analysis_ready: true
  };

  return { json: processedData };

} catch (error) {
  console.error('Metrics processing failed:', error);
  return { 
    json: { 
      error: true, 
      message: 'Unable to process metrics data',
      timestamp: new Date().toISOString()
    } 
  };
}

✅ Learning Checkpoint: n8n Workflow Fundamentals

Before adding AI analysis, verify you can:

  • [ ] Create a basic n8n workflow that fetches Prometheus data
  • [ ] Process the JSON response and extract specific metrics
  • [ ] Send a test email with the processed data
  • [ ] Handle basic error scenarios (API timeout, malformed response)
  • [ ] Understand the data flow from Prometheus → n8n → Email

Quick Test: Build a simple workflow that emails you the current memory percentage every hour. If this seems challenging, spend more time understanding n8n's HTTP request and JavaScript processing nodes.

Common Issues at This Stage:

  • n8n workflow execution failures - Usually authentication or API endpoint problems
  • JavaScript node errors - Often due to missing error handling or incorrect data parsing
  • Email delivery failures - SMTP configuration or authentication issues

Debugging Tip: Use console.log() extensively in JavaScript nodes and check the execution logs for detailed error information.


🧠 Phase 3: Azure OpenAI Integration - Adding Intelligence

This is where the system evolves from "automated alerting" to "intelligent analysis." Azure OpenAI takes our clean metrics and transforms them into actionable insights that even non-technical stakeholders can understand and act upon.

The transition from raw monitoring data to business intelligence happens here—transforming "Memory usage is 76%" into "The system is operating efficiently with healthy resource utilization, indicating well-balanced workloads with adequate capacity for current business requirements."

Why this transformation matters:

  • Technical teams get context about whether metrics indicate real problems
  • Business stakeholders understand impact without needing to interpret technical details
  • Decision makers receive actionable recommendations rather than just status updates

🎯 AI Analysis Configuration

The AI analysis node sends structured data to Azure OpenAI with carefully crafted prompts:

// Azure OpenAI Analysis Node - HTTP Request Configuration
{
  "url": "https://YOUR_RESOURCE.openai.azure.com/openai/deployments/gpt-4o-mini/chat/completions?api-version=2024-02-15-preview",
  "method": "POST",
  "headers": {
    "Content-Type": "application/json",
    "api-key": "YOUR_AZURE_OPENAI_API_KEY"
  },
  "body": {
    "messages": [
      {
        "role": "system",
        "content": "You are an infrastructure monitoring AI assistant. Analyze system metrics and provide clear, actionable insights for both technical teams and business stakeholders. Focus on business impact, recommendations, and next steps."
      },
      {
        "role": "user", 
        "content": "Analyze this system data: Memory usage: {{$json.memory_percent}}%, Status: {{$json.alert_status}}, Health: {{$json.system_health}}. Provide business context, technical assessment, and specific recommendations."
      }
    ],
    "max_tokens": 500,
    "temperature": 0.3
  }
}

📊 Report Generation and Formatting

The AI response gets structured into professional reports suitable for email distribution:

// Create Report Node - JavaScript Code
const input = $input.first();

try {
  // Handle potential API errors
  if (!input.json.choices || input.json.choices.length === 0) {
    throw new Error('No AI response received');
  }

  // Extract AI analysis from Azure OpenAI response
  const aiAnalysis = input.json.choices[0].message.content;
  const metricsData = $('Process Metrics').item.json;

  // Calculate token usage for monitoring
  const tokenUsage = input.json.usage ? input.json.usage.total_tokens : 0;

  const report = {
    report_id: `AI-MONITOR-${new Date().toISOString().slice(0,10)}-${Date.now()}`,
    generated_at: new Date().toISOString(),
    ai_insights: aiAnalysis,
    system_metrics: {
      memory_usage: `${metricsData.memory_percent}%`,
      cpu_usage: `${metricsData.cpu_percent}%`,
      alert_status: metricsData.alert_status,
      system_health: metricsData.system_health
    },
    usage_tracking: {
      tokens_used: tokenUsage,
      model_used: input.json.model || 'gpt-4o-mini'
    },
    metadata: {
      next_check: new Date(Date.now() + 5*60*1000).toISOString(),
      report_type: metricsData.alert_level === 'LOW' ? 'routine' : 'alert',
      confidence_score: 0.95 // Based on data quality
    }
  };

  return { json: report };

} catch (error) {
  // Fallback report without AI analysis
  const metricsData = $('Process Metrics').item.json;

  return { 
    json: { 
      report_id: `AI-MONITOR-ERROR-${Date.now()}`,
      generated_at: new Date().toISOString(),
      ai_insights: `System analysis unavailable due to AI service error. Raw metrics: Memory ${metricsData.memory_percent}%, CPU ${metricsData.cpu_percent}%. Status: ${metricsData.alert_status}`,
      error: true,
      error_message: error.message
    } 
  };
}

This transforms the AI response into a structured report with tracking information, token usage monitoring, and timestamps—everything needed for understanding resource utilization and system performance.

✅ Learning Checkpoint: AI Integration

Before moving to production thinking, verify you can:

  • [ ] Successfully call Azure OpenAI API with authentication
  • [ ] Create prompts that generate useful infrastructure analysis
  • [ ] Handle API errors and implement fallback behavior
  • [ ] Monitor token usage to understand resource consumption
  • [ ] Generate reports that are readable by non-technical stakeholders

Quick Test: Can you send sample metrics to Azure OpenAI and get back analysis that your manager could understand and act upon? If the analysis feels generic or unhelpful, focus on prompt engineering improvement.

Common Issues at This Stage:

  • Azure OpenAI authentication failures - API key or endpoint URL problems
  • Rate limiting errors (HTTP 429) - Too frequent API calls or quota exceeded
  • Generic AI responses - Prompts lack specificity or context
  • Token usage escalation - Inefficient prompts or too frequent analysis

Azure portal cost analysis showing actual token usage, costs, and optimization opportunities from home lab experimentation


📧 Phase 4: Professional Email Reporting

The final component transforms AI insights into professional stakeholder communications that drive business decisions.

🎨 HTML Email Template Design

<!-- Email Template Node - HTML Content -->
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Infrastructure Intelligence Report</title>
<style>
body { font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; margin: 0; padding: 20px; background-color: #f5f5f5; }
.container { max-width: 800px; margin: 0 auto; background-color: white; border-radius: 8px; box-shadow: 0 2px 10px rgba(0,0,0,0.1); }
.header { background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 30px; border-radius: 8px 8px 0 0; }
.content { padding: 30px; }
.metric-card { background-color: #f8f9fa; border-left: 4px solid #007bff; padding: 15px; margin: 15px 0; border-radius: 4px; }
.ai-analysis { background-color: #e8f4fd; border: 1px solid #bee5eb; padding: 20px; border-radius: 6px; margin: 20px 0; }
.footer { background-color: #f8f9fa; padding: 20px; text-align: center; border-radius: 0 0 8px 8px; font-size: 12px; color: #6c757d; }
</style>
</head>
<body>
<div class="container">
<div class="header">
<h1>🤖 AI Infrastructure Intelligence Report</h1>
<p>Automated analysis and recommendations • {{ $json.generated_at }}</p>
</div>

<div class="content">
<h2>🧠 AI-Powered Analysis</h2>
<div class="ai-analysis">
<strong>System Intelligence Summary:</strong><br>
{{ $json.ai_insights }}
</div>

<h2>📊 Current System Metrics</h2>
<div class="metric-card">
<strong>Memory Utilization:</strong> {{ $json.system_metrics.memory_usage }}<br>
<strong>System Status:</strong> {{ $json.system_metrics.alert_status }}<br>
<strong>Health Assessment:</strong> {{ $json.system_metrics.system_health }}
</div>

<h3>💰 Token Usage Analysis</h3>
<div style="background-color: #d1ecf1; padding: 10px; border-radius: 5px;">
<ul>
<li><strong>Tokens Used This Report:</strong> {{ $json.usage_tracking.tokens_used }}</li>
<li><strong>AI Model:</strong> {{ $json.usage_tracking.model_used }}</li>
<li><strong>Analysis Frequency:</strong> Configurable based on monitoring requirements</li>
<li><strong>Note:</strong> Token usage varies based on metric complexity and prompt length</li>
</ul>
</div>

<h3>⏰ Report Metadata</h3>
<ul>
<li><strong>Report ID:</strong> {{ $json.report_id }}</li>
<li><strong>Generated:</strong> {{ $json.generated_at }}</li>
<li><strong>Next Check:</strong> {{ $json.metadata.next_check }}</li>
<li><strong>Report Type:</strong> {{ $json.metadata.report_type }}</li>
</ul>
</div>

<div class="footer">
<p>Generated by AI-Powered Infrastructure Monitoring System<br>
Home Lab Implementation • Learning Project for Cloud Production Application</p>
</div>
</div>
</body>
</html>

📮 Email Delivery Configuration

// Email Delivery Node - SMTP Configuration
{
  "host": "smtp.gmail.com",
  "port": 587,
  "secure": false,
  "auth": {
    "user": "your-email@gmail.com",
    "pass": "your-app-password"
  },
  "from": "AI Infrastructure Monitor <your-email@gmail.com>",
  "to": "stakeholders@company.com",
  "subject": "🤖 Infrastructure Intelligence Report - {{ $json.system_metrics.alert_status }}",
  "html": "{{ $('HTML Template').item.json.html_content }}"
}

The difference between this and traditional monitoring emails is remarkable—instead of "CPU is at 85%," stakeholders get "The system is operating within optimal parameters with excellent resource efficiency, suggesting current workloads are well-balanced and no immediate action is required."

Professional HTML email showing AI analysis, formatted metrics, and business impact summary


🎯 My Learning Journey: What Actually Happened

Understanding the real progression of this project helps set realistic expectations for your own learning experience.

Week 1: Foundation Building I started by getting the basic monitoring stack working. The Docker Compose approach made deployment straightforward, but understanding why each component was needed took time. I spent several days just exploring Prometheus queries and Grafana dashboards—this foundational understanding proved essential for later AI integration.

Week 2: Workflow Automation Discovery
Adding n8n was where things got interesting. The visual workflow builder made complex logic manageable, but I quickly learned that proper error handling isn't optional—it's essential. Using Anthropic Claude to debug JavaScript issues in workflows saved hours of frustration and accelerated my learning significantly.

Week 3: AI Integration Breakthrough This is where the real magic happened. Seeing raw metrics transformed into business-relevant insights was genuinely exciting. The key insight: prompt engineering for infrastructure is fundamentally different from general AI use—specificity about your environment and context matters enormously.

Week 4: Production Thinking The final week focused on understanding how these patterns would apply to real cloud infrastructure. This home lab approach meant I could experiment safely and make mistakes without impact, while building knowledge directly applicable to production environments.


📊 Home Lab Performance Observations

After running the system continuously in my home lab environment:

System Reliability:

  • Uptime: 99.2% (brief restarts for updates and one power outage)
  • Data collection reliability: 99.8% (missed 3 collection cycles due to network issues)
  • AI analysis success rate: 97.1% (some Azure throttling during peak hours)
  • Email delivery: 100% (SMTP proved reliable for testing purposes)

Resource Utilization on Pi 5:

  • Memory usage: 68% peak, 45% average (acceptable for home lab testing)
  • CPU usage: 15% peak, 8% average (monitoring has minimal impact)
  • Storage growth: 120MB/week (Prometheus data with 7-day retention and compression)

Response Times in Home Lab:

  • Metric collection: 2.3 seconds average
  • AI analysis response: 8.7 seconds average (Azure OpenAI)
  • End-to-end report generation: 12.4 seconds
  • Email delivery: 3.1 seconds average

Token Usage Observations:

  • Average tokens per analysis: ~507 tokens
  • Analysis frequency: Hourly during active testing
  • Model efficiency: GPT-4o-mini provided excellent analysis quality for infrastructure metrics
  • Optimization: Prompt refinement reduced token usage by ~20% over time

Note: These are home lab observations for learning purposes. Production cloud deployments would have different performance characteristics and scaling requirements.


⚠️ Challenges and Learning Points

Prometheus query optimization is critical: Inefficient queries can overwhelm the Pi 5, especially during high-cardinality metric collection. Always validate queries against realistic datasets and implement appropriate rate limiting. Complex aggregation queries should be pre-computed where possible.

n8n workflow complexity escalates quickly: What starts as simple data collection becomes complex orchestration with error handling, retries, and fallbacks. Start simple and add features incrementally. I found that using Anthropic Claude to help debug workflow issues significantly accelerated problem resolution.

AI prompt engineering requires iteration: Generic prompts produce generic insights that add little value over traditional alerting. Tailoring prompts for specific infrastructure contexts, stakeholder audiences, and business objectives dramatically improves output quality and relevance.

Network reliability affects everything: Since the system depends on multiple external APIs (Azure OpenAI, SMTP), network connectivity issues cascade through the entire workflow. Implementing proper timeout handling and offline modes is essential for production reliability.

Token usage visibility drives optimization: Monitoring token consumption in real-time helped optimize prompt design and understand the resource implications of different analysis frequencies. This transparency enabled informed decisions about monitoring granularity versus AI resource usage.


🛠️ Common Issues and Solutions

Based on practical experience running this system, here are the most frequent challenges and their resolutions:

Prometheus Connection Issues

Symptom: Targets showing as "DOWN" in Prometheus interface

# Check Prometheus targets status
curl http://localhost:9090/api/v1/targets

# Verify container networking
docker network inspect ai-monitoring-lab_monitoring

# Check if services can reach each other
docker exec prometheus ping node-exporter

n8n Workflow Execution Failures

Symptom: HTTP 500 errors in workflow execution logs

// Add comprehensive error handling to JavaScript nodes
try {
  const result = processMetrics(input);
  return { json: result };
} catch (error) {
  console.error('Processing failed:', error);
  return { 
    json: { 
      error: true, 
      message: error.message,
      timestamp: new Date().toISOString()
    } 
  };
}

Azure OpenAI Rate Limiting

Symptom: Sporadic HTTP 429 errors

// Implement exponential backoff for API calls
async function retryWithBackoff(apiCall, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await apiCall();
    } catch (error) {
      if (error.status === 429 || error.status >= 500) {
        const backoffDelay = Math.min(1000 * Math.pow(2, attempt), 30000);
        await new Promise(resolve => setTimeout(resolve, backoffDelay));
      } else {
        throw error;
      }
    }
  }
  throw new Error(`API call failed after ${maxRetries} attempts`);
}

Memory Management on Raspberry Pi

Symptom: System becomes unresponsive under load

# Add memory limits to docker-compose.yml
services:
  prometheus:
    mem_limit: 1g
    mem_reservation: 512m
  grafana:
    mem_limit: 512m
    mem_reservation: 256m

🎯 What's Next After This Project?

Having successfully built an AI-powered monitoring system in your home lab, you've developed transferable skills for larger infrastructure projects:

Immediate Next Steps:

  • Apply these patterns to cloud infrastructure (AWS EC2, Azure VMs)
  • Expand monitoring to cover application metrics, not just system metrics
  • Explore other AI models and prompt engineering techniques

Future Learning Projects:

  • Security Event Analysis: Use similar AI integration patterns for log analysis
  • Cost Optimization: Apply AI analysis to cloud billing and usage data
  • Capacity Planning: Extend monitoring for predictive resource planning

Skills You Can Now Confidently Apply:

  • Integrating AI services with traditional monitoring tools
  • Creating business-relevant reports from technical metrics
  • Building automated workflows for infrastructure management
  • Designing scalable monitoring architectures

📚 Additional Resources

🛠️ Official Documentation

Container and Orchestration:

Monitoring and Observability:

Workflow Automation:

AI Integration:

🎓 Learning Resources

Docker and Containerization:

Monitoring Fundamentals:

AI and Prompt Engineering:


🎯 Conclusion: AI-Enhanced Monitoring Success

This home lab project successfully demonstrates how AI can transform traditional infrastructure monitoring from simple threshold alerts into intelligent, actionable insights. The structured approach—from Docker fundamentals through AI integration—provides a practical learning path for developing production-ready skills in a cost-effective environment.

Key Achievements:

  • Technical Integration: Successfully combined Prometheus, Grafana, n8n, and Azure OpenAI into a cohesive monitoring system
  • AI Prompt Engineering: Developed context-specific prompts that transform raw metrics into business-relevant insights
  • Professional Communication: Created stakeholder-ready reports that bridge technical data and business impact
  • Cost-Conscious Development: Leveraged subscription benefits for extensive AI experimentation

Most Valuable Insights:

  • AI analysis quality depends on data structure and prompt engineering - generic prompts produce generic insights
  • Visual workflow tools dramatically reduce development complexity while maintaining flexibility
  • Home lab experimentation provides a safe environment for expensive AI service optimization
  • Business context and stakeholder communication are as important as technical implementation

Professional Development Impact: The patterns learned in this project—intelligent data collection, contextual analysis, and automated communication—scale directly to enterprise monitoring requirements. For infrastructure professionals exploring AI integration, this home lab approach provides hands-on experience with real tools and challenges that translate to immediate career value.

The investment in learning these integration patterns delivers improved monitoring effectiveness, reduced alert noise, and enhanced stakeholder communication—essential skills for modern infrastructure teams working with AI-augmented systems.


Share on Share on

🤖 First Steps into AI Automation: My Journey from Trial to Self-Hosted Chaos

What started as 'let me just automate some emails' somehow turned into a comprehensive exploration of every AI automation platform and deployment method known to mankind...

After months of reading about AI automation tools and watching everyone else's productivity skyrocket with clever workflows, I finally decided to stop being a spectator and dive in myself. What started as a simple "let's automate job alert emails" experiment quickly became a week-long journey through cloud trials, self-hosted deployments, OAuth authentication battles, and enough Docker containers to power a small data centre.

In this post, you'll discover:

  • Real costs of AI automation experimentation ($10-50 range)
  • Why self-hosted OAuth2 is significantly harder than cloud versions
  • Performance differences: Pi 5 vs. desktop hardware for local AI
  • When to choose local vs. cloud AI models
  • Time investment reality: ~10 hours over 1 week for this project

Here's how my first real foray into AI automation unfolded — spoiler alert: it involved more container migrations than I initially planned.

Hardware baseline for this project:

💻 Development Environment

  • Primary machine: AMD Ryzen 7 5800X, 32GB DDR4, NVMe SSD
  • Pi 5 setup: 8GB RAM, microSD storage
  • Network: Standard home broadband (important for cloud API performance)

🎯 The Mission: Taming Job Alert Email Chaos

Let's set the scene. If you're drowning in recruitment emails like I was, spending 30+ minutes daily parsing through multiple job listings scattered across different emails, you'll understand the frustration. Each recruitment platform has its own format, some emails contain 5-10 different opportunities, and manually extracting the relevant URLs was becoming a productivity killer.

The vision: Create an automated workflow that would:

  • Scrape job-related emails from my Outlook.com inbox
  • Extract and clean the job data using AI
  • Generate a neat summary email with all the job URLs in one place
  • Send it back to me in a digestible format

Simple enough, right? Famous last words.


🔄 Phase 1: The n8n Cloud Trial Adventure

My research pointed to n8n as the go-to tool for this kind of automation workflow. Being sensible, I started with their 14-day cloud trial rather than jumping straight into self-hosting complexities.

⚙️ Initial Setup & First Success

The n8n cloud interface is genuinely impressive — drag-and-drop workflow building with a proper visual editor that actually makes sense. Within a couple of hours, I had:

Connected to Outlook.com via their built-in connector
Set up email filtering to grab job-related messages
Configured basic data processing to extract text content
Integrated OpenAI API for intelligent job URL extraction

n8n jobs workflow with OpenAI API integration

🤖 The AI Integration Challenge

This is where things got interesting. Initially, I connected the workflow to my OpenAI API account, using GPT-4 to parse email content and extract job URLs. The AI component worked brilliantly — almost too well, since I managed to burn through my $10 worth of token credits in just two days of testing.

The cost reality: Those "just testing a few prompts" sessions add up fast. A single complex email with multiple job listings processed through GPT-4 was costing around $0.15-0.30 per API call. When you're iterating on prompts and testing edge cases, those costs compound quickly.

Lesson learned: Test with smaller models first, then scale up. GPT-4 is excellent but not cheap for experimental workflows.

🎯 Partial Success (The Classic IT Story)

The workflow was partially successful — and in true IT fashion, "partially" is doing some heavy lifting here. While the automation successfully processed emails and generated summaries, it had one glaring limitation: it only extracted one job URL per email, when most recruitment emails contain multiple opportunities.

What this actually meant: A typical recruitment email might contain 5-7 job listings with individual URLs, but my workflow would only capture the first one it encountered. This wasn't a parsing issue — the AI was correctly identifying all the URLs in its response, but the n8n workflow was only processing the first result from the AI output.

Why this limitation exists: The issue stemmed from how I'd configured the data processing nodes in n8n. The workflow was treating the AI response as a single data item rather than an array of multiple URLs. This is a common beginner mistake when working with structured data outputs.

This became the recurring theme of my experimentation week: everything works, just not quite how you want it to.

💡 Enter Azure OpenAI

Rather than continue burning through OpenAI credits, I pivoted to Azure OpenAI. This turned out to be a smart move for several reasons:

  • Cost control: Better integration with my existing Azure credits
  • Familiar environment: Already comfortable with Azure resource management
  • Testing flexibility: My Visual Studio Developer subscription gives me £120 monthly credits

I deployed a GPT-4 Mini model in my test lab Azure tenant — perfect for experimentation without breaking the bank.

Azure OpenAI GPT-4 Mini deployment configuration

The Azure OpenAI integration worked seamlessly with n8n, and I successfully redirected my workflow to use the new endpoint. Finally, something that worked first time.

n8n jobs workflow with Azure OpenAI integration


🐳 Phase 2: Self-Hosting Ambitions (Container Edition #1)

With the n8n cloud trial clocking ticking down, I faced the classic build-vs-buy decision. The cloud version was excellent, but I wanted full control and the ability to experiment without subscription constraints. The monthly $20 cost wasn't prohibitive, but the learning opportunity of self-hosting was too appealing to pass up.

Enter self-hosting with Docker containers — specifically, targeting my Raspberry Pi 5 setup.

🏠 The OpenMediaVault Experiment

My first attempt involved deploying n8n as a self-hosted Docker container on my OpenMediaVault (OMV) setup. For those unfamiliar, OMV is a network-attached storage (NAS) solution built on Debian, perfect for home lab environments where you want proper storage management with container capabilities.

Why the Pi 5 + OMV route:

  • Always-on availability: Unlike my main PC, the Pi runs 24/7
  • Low power consumption: Perfect for continuous automation workflows
  • Storage integration: OMV provides excellent Docker volume management
  • Learning opportunity: Understanding self-hosted deployment challenges

The setup:

  • Host: Raspberry Pi 5 running OpenMediaVault
  • Backend storage: NAS device for persistent data
  • Database: PostgreSQL container for n8n's backend
  • Edition: n8n Community Edition (self-hosted)

OpenMediaVault Docker container management interface

😤 The Great OAuth Authentication Battle

This is where my self-hosting dreams met reality with a resounding thud.

I quickly discovered that replicating my cloud workflow wasn't going to be straightforward. The self-hosted community edition has functionality restrictions compared to the cloud version, but more frustratingly, I couldn't get OAuth2 authentication working properly.

Why OAuth2 is trickier with self-hosted setups:

  • Redirect URI complexity: Cloud services handle callback URLs automatically, but self-hosted instances need manually configured redirect URIs
  • App registration headaches: Azure app registrations expect specific callback patterns that don't align well with dynamic self-hosted URLs
  • Token management: Cloud versions handle OAuth token refresh automatically; self-hosted requires manual configuration
  • Security certificate requirements: Many OAuth providers now require HTTPS callbacks, adding SSL certificate management complexity

The specific challenges I hit:

  • Outlook.com authentication: Couldn't configure OAuth2 credentials using an app registration from my test lab Azure tenant
  • Exchange Online integration: Also failed to connect via app registration — kept getting "invalid redirect URI" errors
  • Documentation gaps: Self-hosting authentication setup felt less polished than the cloud version

After several hours over two days debugging OAuth flows and Azure app registrations, I admitted defeat on the email integration front. Sometimes retreat is the better part of valour.

🌤️ Simple Success: Weather API Workflow

Rather than abandon the entire self-hosting experiment, I pivoted to a simpler proof-of-concept. I created a basic workflow using:

  • OpenWeatherMap API for weather data
  • Gmail integration with app passwords (much simpler than OAuth2)
  • Basic data processing and email generation

This worked perfectly and proved that the self-hosted n8n environment was functional — the issue was specifically with the more complex authentication requirements of my original workflow.

Simple n8n weather workflow using OpenAI API


🐳 Phase 3: The WSL Migration (Container Migration #2)

While the Pi 5 setup was working fine for simple workflows, I started feeling the hardware limitations when testing more complex operations. Loading even smaller AI models was painfully slow, and memory constraints meant I couldn't experiment with anything approaching production-scale workflows.

Time for Container Migration #2.

🖥️ Moving to WSL + Docker Desktop

With the Pi 5 hitting performance limits, I decided to experiment with local AI models using Ollama (a local LLM hosting platform) and OpenWebUI (a web interface for interacting with AI models). This required more computational resources than the Pi could provide, so I deployed these tools using Docker Compose inside Ubuntu running on Windows WSL (Windows Subsystem for Linux).

This setup offered several advantages:

Why WSL over the Pi 5:

  • Better hardware resources: Access to my Windows PC's 32GB RAM and 8-core CPU vs. Pi 5's 8GB RAM limitation
  • Docker Desktop integration: Visual container management through familiar interface
  • Development flexibility: Easier to iterate and debug workflows with full IDE access
  • Performance reality: Local LLM model loading went from 1+ minutes on Pi 5 to under 30 seconds

My development machine specs:

  • CPU: AMD Ryzen 7 5800H with Radeon Graphics
  • RAM: 32GB DDR4
  • Storage: NVMe SSD for fast model loading
  • GPU: None (pure CPU inference)

Time Investment Reality:

  • n8n cloud setup: 2-3 hours (including initial workflow creation)
  • OAuth2 debugging: 3+ hours over 2 days (ongoing challenge)
  • Pi 5 container setup: 2+ hours
  • Docker Desktop container set up: 2+ hours
  • Total project time: ~10 hours over 1 week

The new stack:

  • Host: Ubuntu in WSL2 on Windows
  • Container orchestration: Docker Compose
  • Management: Docker Desktop for Windows
  • Models: Ollama for local LLM hosting
  • Interface: OpenWebUI for model interaction

Docker Desktop showing Ollama containers running

🧠 Local LLM Experimentation

This is where the project took an interesting turn. Rather than continuing to rely on cloud APIs, I started experimenting with local language models through Ollama.

Why local LLMs?

  • Cost control: No per-token charges for experimentation
  • Privacy: Sensitive data stays on local infrastructure
  • Learning opportunity: Understanding how different models perform

The Docker Compose setup made it trivial to spin up different model combinations and test their performance on my email processing use case.

⚠️ Reality Check: Local vs. Cloud Performance

Let's be honest here — using an LLM locally is never going to be a fully featured replacement for the likes of ChatGPT or Claude. This became apparent pretty quickly during my testing.

Performance realities:

  • Speed: Unless you're running some serious hardware, the performance will be a lot slower than the online AI counterparts
  • Model capabilities: Local models (especially smaller ones that run on consumer hardware) lack the sophisticated reasoning of GPT-4 or Claude
  • Resource constraints: My standard PC setup meant I was limited to smaller model variants
  • Response quality: Noticeably less nuanced and accurate responses compared to cloud services

Where local LLMs do shine:

  • Privacy-sensitive tasks: When you can't send data to external APIs
  • Development and testing: Iterating on prompts without burning through API credits
  • Learning and experimentation: Understanding how different model architectures behave
  • Offline scenarios: When internet connectivity is unreliable

The key insight: local LLMs are a complement to cloud services, not a replacement. Use them when privacy, cost, or learning are the primary concerns, but stick with cloud APIs when you need reliable, high-quality results.

🔗 Hybrid Approach: Best of Both Worlds

The final configuration became a hybrid approach:

  • OpenWebUI connected to Azure OpenAI for production-quality responses
  • Local Ollama models for development and privacy-sensitive testing
  • Docker containers exposed through Docker Desktop for easy management

This gave me the flexibility to choose the right tool for each task — cloud APIs when I need reliability and performance, local models when I want to experiment or maintain privacy.

OpenWebUI local interface with model selection

💰 Cost Reality Check

After a week of experimentation, here's how the costs actually broke down:

Service Trial Period Monthly Cost My Usage Notes
n8n Cloud 14 days free €20/month 2 weeks testing Full OAuth2 features
OpenAI API Pay-per-use Variable $10 in 2 days Expensive for testing
Azure OpenAI Free credits £120/month budget ~£15 used Better for experimentation
Self-hosted Free Hardware + time 2 days setup OAuth2 complexity

Key insight: The "free" self-hosted option came with a significant time cost — debugging authentication issues for hours vs. having things work immediately in the cloud version.


📊 Current State: Lessons Learned & Next Steps

After a week of container deployments, OAuth battles, and API integrations, here's where I've landed:

✅ What's Working Well

Technical Stack:

  • n8n self-hosted: Currently running 2 active workflows (weather alerts, basic data processing)
  • Azure OpenAI integration: Reliable and cost-effective for AI processing — saving ~£25/month vs. direct OpenAI API
  • Docker containerisation: Easy deployment and management across different environments
  • WSL environment: 10x performance improvement over Pi 5 for local AI model loading

Process Improvements:

  • Iterative approach: Start simple, add complexity gradually — this saved significant debugging time
  • Hybrid cloud/local strategy: Use the right tool for each requirement rather than forcing one solution
  • Container flexibility: Easy to migrate and scale across different hosts when hardware constraints appear

Daily productivity impact: While the original job email automation isn't fully solved, the weather automation saves ~10 minutes daily, and the learning has already paid dividends in other automation projects.

⚠️ Ongoing Challenges (The Work-in-Progress List)

Authentication Issues:

  • OAuth2 integration with Outlook.com/Exchange Online still unresolved
  • Need to explore alternative authentication methods or different email providers
  • May require diving deeper into Azure app registration configurations

Workflow Limitations:

  • Original job email processing goal partially achieved but needs refinement
  • Multiple job URL extraction per email still needs work
  • Error handling and retry logic need improvement

Infrastructure Decisions:

  • Balancing local vs. cloud resources for different use cases
  • Determining optimal Docker deployment strategy for production workflows
  • Managing costs across multiple AI service providers

Decision-making process during failures: When something doesn't work, I typically: (1) Troubleshoot the exact error using ChatGPT or Anthropic Claude, (2) Search for similar issues in community forums, (3) Try a simpler alternative approach, (4) If still stuck after 2-3 hours, pivot to a different method rather than continuing to debug indefinitely.

🚀 Next Steps & Future Experiments

Short-term goals (next 2-4 weeks):

  1. Resolve OAuth2 authentication for proper email integration
  2. Improve job URL extraction accuracy — tackle the multiple URLs per email challenge
  3. Add error handling and logging to existing workflows
  4. Explore alternative email providers if Outlook.com integration remains problematic

Medium-term exploration (next 2-3 months):

  1. Local LLM performance tuning for specific use cases
  2. Workflow templates for common automation patterns
  3. Integration with other productivity tools (calendar, task management)
  4. Monitoring and alerting for automated workflows

🛠️ Quick Wins for Beginners

If you're just starting your AI automation journey, here are the lessons learned that could save you time:

🎯 Start Simple First

  • Begin with n8n cloud trial to understand the platform without authentication headaches
  • Use simple APIs (weather, RSS feeds) before tackling complex ones (email OAuth2)
  • Test with smaller AI models before jumping to GPT-4

💡 Budget for Experimentation

  • Set aside $20-50 for API testing — it goes faster than you think
  • Azure OpenAI credits can be more cost-effective than direct OpenAI API for learning
  • Factor in time costs when choosing self-hosted vs. cloud solutions

🔧 Have Fallback Options Ready

  • Plan alternative authentication methods (app passwords vs. OAuth2)
  • Keep both cloud and local AI options available
  • Document what works and what doesn't for future reference

🔧 Technical Resources & Documentation

For anyone inspired to start their own AI automation journey, here are the key resources that proved invaluable:

🛠️ Core Tools & Platforms

  • n8n — Visual workflow automation platform
  • Docker — Containerisation platform
  • Docker Compose — Multi-container orchestration tool
  • OpenMediaVault — NAS/storage management solution

🤖 AI & LLM Resources

📚 Setup Guides & Documentation

🔧 Troubleshooting Common Issues

Based on my week of trial and error, here are the most common problems you'll likely encounter:

🔐 OAuth2 Authentication Failures

Symptoms: "Invalid redirect URI" or "Authentication failed" errors when connecting to email services.

Likely causes:

  • Redirect URI mismatch between app registration and n8n configuration
  • Self-hosted instance not using HTTPS for callbacks
  • App registration missing required API permissions

Solutions to try:

  • Use app passwords instead of OAuth2 where possible (Gmail, Outlook.com) — Note: App passwords are simpler username/password credentials that bypass OAuth2 complexity but offer less security
  • Ensure your n8n instance is accessible via HTTPS with valid SSL certificate
  • Double-check app registration redirect URIs match exactly (including trailing slashes)
  • Start with cloud trial to verify workflow logic before self-hosting

🐳 Container Performance Issues

Symptoms: Slow model loading, container crashes, high memory usage.

Likely causes:

  • Insufficient RAM allocation to Docker
  • CPU-intensive models running on inadequate hardware
  • Competing containers for limited resources

Solutions to try:

  • Increase Docker memory limits in Docker Desktop settings
  • Use smaller model variants (7B instead of 13B+ parameters)
  • Monitor resource usage with docker stats command
  • Consider migrating from Pi to x86 hardware for better performance

💸 API Rate Limiting and Costs

Symptoms: API calls failing, unexpected high costs, token limits exceeded.

Likely causes:

  • Testing with expensive models (GPT-4) instead of cheaper alternatives
  • No rate limiting in workflow configurations
  • Inefficient prompt design causing high token usage

Solutions to try:

  • Start testing with GPT-3.5-turbo or GPT-4-mini models
  • Implement workflow rate limiting and retry logic
  • Optimize prompts to reduce token consumption
  • Set API spending alerts in provider dashboards

💻 Resource Requirements Summary

Minimum Requirements for Recreation:

  • Cloud approach: n8n trial account + $20-50 API experimentation budget
  • Self-hosted approach: 8GB+ RAM, Docker knowledge, 2-3 days setup time
  • Local AI experimentation: 16GB+ RAM recommended, considerable patience, NVMe storage preferred
  • Network: Stable broadband connection for cloud API performance

💭 Final Thoughts: The Joy of Controlled Chaos

What started as a simple email automation project became a comprehensive exploration of modern AI automation tools. While I didn't achieve my original goal completely (yet), the journey provided invaluable hands-on experience with:

  • Container orchestration across different environments
  • AI service integration patterns and best practices
  • Authentication complexity in self-hosted vs. cloud environments
  • Hybrid deployment strategies for flexibility and cost control

The beauty of this approach is that each "failed" experiment taught me something valuable about the tools and processes involved. The OAuth2 authentication issues, while frustrating, highlighted the importance of proper authentication design. The container migrations demonstrated the flexibility of modern deployment approaches.

Most importantly: I now have a functional foundation for AI automation experiments, with both cloud and local capabilities at my disposal.

Is it overengineered for a simple email processing task? Absolutely. Was it worth the learning experience? Without question.

Have you tackled similar AI automation projects? I'd particularly love to hear from anyone who's solved the OAuth2 self-hosting puzzle or found creative workarounds for email processing limitations. Drop me a line if you've found better approaches to any of these challenges.


📸 Image Requirements Summary

For anyone recreating this setup, here are the key screenshots included in this post:

  1. n8n-jobs-workflow-openai.png — Original workflow using direct OpenAI API (the expensive version that burned through $10 in 2 days)
  2. azure-openai-deployment.png — Azure OpenAI Studio showing GPT-4 Mini deployment configuration
  3. n8n-jobs-workflow-azure.png — Improved workflow using Azure OpenAI integration (the cost-effective version)
  4. omv-docker-n8n-containers.png — OpenMediaVault interface showing Docker container management on Pi 5
  5. n8n-weather-workflow.png — Simple weather API to Gmail workflow demonstrating successful self-hosted setup
  6. docker-desktop-ollama.png — Docker Desktop showing Ollama and OpenWebUI containers running on WSL
  7. openwebui-local.png — OpenWebUI interface showing both Azure OpenAI and local model selection options

Each image demonstrates the practical implementation rather than theoretical concepts, helping readers visualize the actual tools and interfaces involved in the automation journey.

Share on Share on

🔄 Bringing Patch Management In-House: Migrating from MSP to Azure Update Manager

It's all fun and games until the MSP contract expires and you realise 90 VMs still need their patching schedules sorted…

With our MSP contract winding down, the time had come to bring VM patching back in house. Our third-party provider had been handling it with their own tooling, which would no longer be used when the service contract expired.

Enter Azure Update Manager — the modern, agentless way to manage patching schedules across your Azure VMs. Add a bit of PowerShell, sprinkle in some Azure Policy, and you've got yourself a scalable, policy-driven solution that's more visible, auditable, and way more maintainable.

Here's how I made the switch — and managed to avoid a patching panic.


⚙️ Prerequisites & Permissions

Let's get the plumbing sorted before diving in.

You'll need:

  • The right PowerShell modules:
Install-Module Az -Scope CurrentUser -Force
Import-Module Az.Maintenance, Az.Resources, Az.Compute
  • An account with Contributor permissions (or higher)
  • Registered providers to avoid mysterious error messages:
Register-AzResourceProvider -ProviderNamespace Microsoft.Maintenance
Register-AzResourceProvider -ProviderNamespace Microsoft.GuestConfiguration

Why Resource Providers? Azure Update Manager needs these registered to create the necessary API endpoints and resource types in your subscription. Without them, you'll get cryptic "resource type not found" errors.

Official documentation on Azure Update Manager prerequisites


🕵️ Step 1 – Audit the Current Setup

First order of business: collect the patching summary data from the MSP — which, helpfully, came in the form of multiple weekly CSV exports.

I used GenAI to wrangle the mess into a structured format. The result was a clear categorisation of VMs based on the day and time they were typically patched — a solid foundation to work from.


🧱 Step 2 – Create Seven New Maintenance Configurations

This is the foundation of Update Manager — define your recurring patch windows.

Click to expand: Create Maintenance Configurations (Sample Script)
# Azure Update Manager - Create Weekly Maintenance Configurations
# Pure PowerShell syntax

# Define parameters
$resourceGroupName = "rg-maintenance-uksouth-001"
$location = "uksouth"
$timezone = "GMT Standard Time"
$startDateTime = "2024-06-01 21:00"
$duration = "03:00"  # 3 hours - meets minimum requirement

# Day mapping for config naming (3-letter lowercase)
$dayMap = @{
    "Monday"    = "mon"
    "Tuesday"   = "tue" 
    "Wednesday" = "wed"
    "Thursday"  = "thu"
    "Friday"    = "fri"
    "Saturday"  = "sat"
    "Sunday"    = "sun"
}

# Create maintenance configurations for each day
foreach ($day in $dayMap.Keys) {
    $shortDay = $dayMap[$day]
    $configName = "contoso-maintenance-config-vms-$shortDay"

    Write-Host "Creating: $configName for $day..." -ForegroundColor Yellow

    try {
        $result = New-AzMaintenanceConfiguration `
            -ResourceGroupName $resourceGroupName `
            -Name $configName `
            -MaintenanceScope "InGuestPatch" `
            -Location $location `
            -StartDateTime $startDateTime `
            -Timezone $timezone `
            -Duration $duration `
            -RecurEvery "Week $day" `
            -InstallPatchRebootSetting "IfRequired" `
            -ExtensionProperty @{"InGuestPatchMode" = "User"} `
            -WindowParameterClassificationToInclude @("Critical", "Security") `
            -LinuxParameterClassificationToInclude @("Critical", "Security") `
            -Tag @{
                "Application"  = "Azure Update Manager"
                "Owner"        = "Contoso"
                "PatchWindow"  = $shortDay
            } `
            -ErrorAction Stop

        Write-Host "✓ SUCCESS: $configName" -ForegroundColor Green

        # Quick validation
        $createdConfig = Get-AzMaintenanceConfiguration -ResourceGroupName $resourceGroupName -Name $configName
        Write-Host "  Validated: $($createdConfig.RecurEvery) schedule confirmed" -ForegroundColor Gray

    } catch {
        Write-Host "✗ FAILED: $configName - $($_.Exception.Message)" -ForegroundColor Red
        continue
    }
}

⚠️ Don't forget: duration format is ISO 8601, not "2 hours" — and start time has to match the day it's tied to.

Learn more about New-AzMaintenanceConfiguration


🛠️ Step 3 – Tweak the Maintenance Configs

Some patch windows felt too tight — and, just as importantly, I needed to avoid overlaps with existing backup jobs. Rather than let a large CU fail halfway through or run headlong into an Azure Backup job, I extended the duration on select configs and staggered them across the week:

$config = Get-AzMaintenanceConfiguration -ResourceGroupName "rg-maintenance-uksouth-001" -Name "contoso-maintenance-config-vms-sun"
$config.Duration = "04:00"
Update-AzMaintenanceConfiguration -ResourceGroupName "rg-maintenance-uksouth-001" -Name "contoso-maintenance-config-vms-sun" -Configuration $config

# Verify the change
$updatedConfig = Get-AzMaintenanceConfiguration -ResourceGroupName "rg-maintenance-uksouth-001" -Name "contoso-maintenance-config-vms-sun"
Write-Host "Sunday window now: $($updatedConfig.Duration) duration" -ForegroundColor Green

Learn more about Update-AzMaintenanceConfiguration


🤖 Step 4 – Use AI to Group VMs by Patch Activity

Armed with CSV exports of the latest patching summaries, I got AI to do the grunt work and make sense of the contents.

What I did:

  1. Exported MSP data: Weekly CSV reports showing patch installation timestamps for each VM
  2. Used Gen AI with various iterative prompts, starting the conversation with this:

    "Attached is an export summary of the current patching activity from our incumbent MSP who currently look after the patching of the VM's in Azure I need you to review timestamps and work out which maintenance window each vm is currently in, and then match that to the appropriate maintenance config that we have just created. If there are mis matches in new and current schedule then we may need to tweak the settings of the new configs"

  3. AI analysis revealed:

  4. 60% of VMs were patching on one weekday evening
  5. Several critical systems patching simultaneously
  6. No consideration for application dependencies

  7. AI recommendation: Spread VMs across weekdays based on:

  8. Criticality: Domain controllers on different days
  9. Function: Similar servers on different days (avoid single points of failure)
  10. Dependencies: Database servers before application servers

The result: A logical rebalancing that avoided "all our eggs in Sunday 1AM" basket and considered business impact.

Why this matters: The current patching schedule was not optimized for business continuity. AI helped identify risks we hadn't considered.


🔍 Step 5 – Discover All VMs and Identify Gaps

Before diving into bulk tagging, I needed to understand what we were working with across all subscriptions.

First, let's see what VMs we have:

Click to expand: Discover Untagged VMs (Sample Script)
# Discover Untagged VMs Script for Azure Update Manager
# This script identifies VMs that are missing Azure Update Manager tags

$scriptStart = Get-Date

Write-Host "=== Azure Update Manager - Discover Untagged VMs ===" -ForegroundColor Cyan
Write-Host "Scanning all accessible subscriptions for VMs missing maintenance tags..." -ForegroundColor White
Write-Host ""

# Function to check if VM has Azure Update Manager tags
function Test-VMHasMaintenanceTags {
    param($VM)

    # Check for the three required tags
    $hasOwnerTag = $VM.Tags -and $VM.Tags.ContainsKey("Owner") -and $VM.Tags["Owner"] -eq "Contoso"
    $hasUpdatesTag = $VM.Tags -and $VM.Tags.ContainsKey("Updates") -and $VM.Tags["Updates"] -eq "Azure Update Manager"
    $hasPatchWindowTag = $VM.Tags -and $VM.Tags.ContainsKey("PatchWindow")

    return $hasOwnerTag -and $hasUpdatesTag -and $hasPatchWindowTag
}

# Function to get VM details for reporting
function Get-VMDetails {
    param($VM, $SubscriptionName)

    return [PSCustomObject]@{
        Name = $VM.Name
        ResourceGroup = $VM.ResourceGroupName
        Location = $VM.Location
        Subscription = $SubscriptionName
        SubscriptionId = $VM.SubscriptionId
        PowerState = $VM.PowerState
        OsType = $VM.StorageProfile.OsDisk.OsType
        VmSize = $VM.HardwareProfile.VmSize
        Tags = if ($VM.Tags) { ($VM.Tags.Keys | ForEach-Object { "$_=$($VM.Tags[$_])" }) -join "; " } else { "No tags" }
    }
}

# Initialize collections
$taggedVMs = @()
$untaggedVMs = @()
$allVMs = @()
$subscriptionSummary = @{}

Write-Host "=== DISCOVERING VMs ACROSS ALL SUBSCRIPTIONS ===" -ForegroundColor Cyan

# Get all accessible subscriptions
$subscriptions = Get-AzSubscription | Where-Object { $_.State -eq "Enabled" }
Write-Host "Found $($subscriptions.Count) accessible subscriptions" -ForegroundColor White

foreach ($subscription in $subscriptions) {
    try {
        Write-Host "`nScanning subscription: $($subscription.Name) ($($subscription.Id))" -ForegroundColor Magenta
        $null = Set-AzContext -SubscriptionId $subscription.Id -ErrorAction Stop

        # Get all VMs in this subscription
        Write-Host "  Retrieving VMs..." -ForegroundColor Gray
        $vms = Get-AzVM -Status -ErrorAction Continue

        $subTagged = 0
        $subUntagged = 0
        $subTotal = $vms.Count

        Write-Host "  Found $subTotal VMs in this subscription" -ForegroundColor White

        foreach ($vm in $vms) {
            $vmDetails = Get-VMDetails -VM $vm -SubscriptionName $subscription.Name
            $allVMs += $vmDetails

            if (Test-VMHasMaintenanceTags -VM $vm) {
                $taggedVMs += $vmDetails
                $subTagged++
                Write-Host "    ✓ Tagged: $($vm.Name)" -ForegroundColor Green
            } else {
                $untaggedVMs += $vmDetails
                $subUntagged++
                Write-Host "    ⚠️ Untagged: $($vm.Name)" -ForegroundColor Yellow
            }
        }

        # Store subscription summary
        $subscriptionSummary[$subscription.Name] = @{
            Total = $subTotal
            Tagged = $subTagged
            Untagged = $subUntagged
            SubscriptionId = $subscription.Id
        }

        Write-Host "  Subscription Summary - Total: $subTotal | Tagged: $subTagged | Untagged: $subUntagged" -ForegroundColor Gray

    }
    catch {
        Write-Host "  ✗ Error scanning subscription $($subscription.Name): $($_.Exception.Message)" -ForegroundColor Red
        $subscriptionSummary[$subscription.Name] = @{
            Total = 0
            Tagged = 0
            Untagged = 0
            Error = $_.Exception.Message
        }
    }
}

Write-Host ""
Write-Host "=== OVERALL DISCOVERY SUMMARY ===" -ForegroundColor Cyan
Write-Host "Total VMs found: $($allVMs.Count)" -ForegroundColor White
Write-Host "VMs with maintenance tags: $($taggedVMs.Count)" -ForegroundColor Green
Write-Host "VMs missing maintenance tags: $($untaggedVMs.Count)" -ForegroundColor Red

if ($untaggedVMs.Count -eq 0) {
    Write-Host "� ALL VMs ARE ALREADY TAGGED! �" -ForegroundColor Green
    Write-Host "No further action required." -ForegroundColor White
    exit 0
}

Write-Host ""
Write-Host "=== SUBSCRIPTION BREAKDOWN ===" -ForegroundColor Cyan
$subscriptionSummary.GetEnumerator() | Sort-Object Name | ForEach-Object {
    $sub = $_.Value
    if ($sub.Error) {
        Write-Host "$($_.Key): ERROR - $($sub.Error)" -ForegroundColor Red
    } else {
        $percentage = if ($sub.Total -gt 0) { [math]::Round(($sub.Tagged / $sub.Total) * 100, 1) } else { 0 }
        Write-Host "$($_.Key): $($sub.Tagged)/$($sub.Total) tagged ($percentage%)" -ForegroundColor White
    }
}

Write-Host ""
Write-Host "=== UNTAGGED VMs DETAILED LIST ===" -ForegroundColor Red
Write-Host "The following $($untaggedVMs.Count) VMs are missing Azure Update Manager maintenance tags:" -ForegroundColor White

# Group untagged VMs by subscription for easier reading
$untaggedBySubscription = $untaggedVMs | Group-Object Subscription

foreach ($group in $untaggedBySubscription | Sort-Object Name) {
    Write-Host "`n� Subscription: $($group.Name) ($($group.Count) untagged VMs)" -ForegroundColor Magenta

    $group.Group | Sort-Object Name | ForEach-Object {
        Write-Host "  • $($_.Name)" -ForegroundColor Yellow
        Write-Host "    Resource Group: $($_.ResourceGroup)" -ForegroundColor Gray
        Write-Host "    Location: $($_.Location)" -ForegroundColor Gray
        Write-Host "    OS Type: $($_.OsType)" -ForegroundColor Gray
        Write-Host "    VM Size: $($_.VmSize)" -ForegroundColor Gray
        Write-Host "    Power State: $($_.PowerState)" -ForegroundColor Gray
        if ($_.Tags -ne "No tags") {
            Write-Host "    Existing Tags: $($_.Tags)" -ForegroundColor DarkGray
        }
        Write-Host ""
    }
}

Write-Host "=== ANALYSIS BY VM CHARACTERISTICS ===" -ForegroundColor Cyan

# Analyze by OS Type
$untaggedByOS = $untaggedVMs | Group-Object OsType
Write-Host "`n� Untagged VMs by OS Type:" -ForegroundColor White
$untaggedByOS | Sort-Object Name | ForEach-Object {
    Write-Host "  $($_.Name): $($_.Count) VMs" -ForegroundColor White
}

# Analyze by Location
$untaggedByLocation = $untaggedVMs | Group-Object Location
Write-Host "`n� Untagged VMs by Location:" -ForegroundColor White
$untaggedByLocation | Sort-Object Count -Descending | ForEach-Object {
    Write-Host "  $($_.Name): $($_.Count) VMs" -ForegroundColor White
}

# Analyze by VM Size (to understand workload types)
$untaggedBySize = $untaggedVMs | Group-Object VmSize
Write-Host "`n� Untagged VMs by Size:" -ForegroundColor White
$untaggedBySize | Sort-Object Count -Descending | Select-Object -First 10 | ForEach-Object {
    Write-Host "  $($_.Name): $($_.Count) VMs" -ForegroundColor White
}

# Analyze by Resource Group (might indicate application/workload groupings)
$untaggedByRG = $untaggedVMs | Group-Object ResourceGroup
Write-Host "`n� Untagged VMs by Resource Group (Top 10):" -ForegroundColor White
$untaggedByRG | Sort-Object Count -Descending | Select-Object -First 10 | ForEach-Object {
    Write-Host "  $($_.Name): $($_.Count) VMs" -ForegroundColor White
}

Write-Host ""
Write-Host "=== POWER STATE ANALYSIS ===" -ForegroundColor Cyan
$powerStates = $untaggedVMs | Group-Object PowerState
$powerStates | Sort-Object Count -Descending | ForEach-Object {
    Write-Host "$($_.Name): $($_.Count) VMs" -ForegroundColor White
}

Write-Host ""
Write-Host "=== EXPORT OPTIONS ===" -ForegroundColor Cyan
Write-Host "You can export this data for further analysis:" -ForegroundColor White

# Export to CSV option
$timestamp = Get-Date -Format "yyyyMMdd-HHmm"
$csvPath = "D:\UntaggedVMs-$timestamp.csv"

try {
    $untaggedVMs | Export-Csv -Path $csvPath -NoTypeInformation
    Write-Host "✓ Exported untagged VMs to: $csvPath" -ForegroundColor Green
} catch {
    Write-Host "✗ Failed to export CSV: $($_.Exception.Message)" -ForegroundColor Red
}

# Show simple list for easy copying
Write-Host ""
Write-Host "=== SIMPLE VM NAME LIST (for copy/paste) ===" -ForegroundColor Cyan
Write-Host "VM Names:" -ForegroundColor White
$untaggedVMs | Sort-Object Name | ForEach-Object { Write-Host "  $($_.Name)" -ForegroundColor Yellow }

Write-Host ""
Write-Host "=== NEXT STEPS RECOMMENDATIONS ===" -ForegroundColor Cyan
Write-Host "1. Review the untagged VMs list above" -ForegroundColor White
Write-Host "2. Investigate why these VMs were not in the original patching schedule" -ForegroundColor White
Write-Host "3. Determine appropriate maintenance windows for these VMs" -ForegroundColor White
Write-Host "4. Consider grouping by:" -ForegroundColor White
Write-Host "   • Application/workload (Resource Group analysis)" -ForegroundColor Gray
Write-Host "   • Environment (naming patterns, tags)" -ForegroundColor Gray
Write-Host "   • Business criticality" -ForegroundColor Gray
Write-Host "   • Maintenance window preferences" -ForegroundColor Gray
Write-Host "5. Run the tagging script to assign maintenance windows" -ForegroundColor White

Write-Host ""
Write-Host "=== AZURE RESOURCE GRAPH QUERY ===" -ForegroundColor Cyan
Write-Host "Use this query in Azure Resource Graph Explorer to verify results:" -ForegroundColor White
Write-Host ""
Write-Host @"
Resources
| where type == "microsoft.compute/virtualmachines"
| where tags.PatchWindow == "" or isempty(tags.PatchWindow) or isnull(tags.PatchWindow)
| project name, resourceGroup, subscriptionId, location, 
          osType = properties.storageProfile.osDisk.osType,
          vmSize = properties.hardwareProfile.vmSize,
          powerState = properties.extended.instanceView.powerState.displayStatus,
          tags
| sort by name asc
"@ -ForegroundColor Gray

Write-Host ""
Write-Host "Script completed at $(Get-Date)" -ForegroundColor Cyan
Write-Host "Total runtime: $((Get-Date) - $scriptStart)" -ForegroundColor Gray

Discovery results:

  • 35 VMs from the original MSP schedule (our planned list)
  • 12 additional VMs not in the MSP schedule (the "stragglers")
  • Total: 90 VMs needing Update Manager tags

Key insight: The MSP wasn't managing everything. Several dev/test VMs and a few production systems were missing from their schedule.


✍️ Step 6 – Bulk Tag All VMs with Patch Windows

Now for the main event: tagging all VMs with their maintenance windows. This includes both our planned VMs and the newly discovered ones.

🎯 Main VM Tagging (Planned Schedule)

Each tag serves a specific purpose:

  • PatchWindow — The key tag used by dynamic scopes to assign VMs to maintenance configurations
  • Owner — For accountability and filtering
  • Updates — Identifies VMs managed by Azure Update Manager
Click to expand: Multi-Subscription Azure Update Manager VM Tagging (Sample Script)
# Multi-Subscription Azure Update Manager VM Tagging Script
# This script discovers VMs across multiple subscriptions and tags them appropriately

Write-Host "=== Multi-Subscription Azure Update Manager - VM Tagging Script ===" -ForegroundColor Cyan

# Function to safely tag a VM
function Set-VMMaintenanceTags {
    param(
        [string]$VMName,
        [string]$ResourceGroupName,
        [string]$SubscriptionId,
        [hashtable]$Tags,
        [string]$MaintenanceWindow
    )

    try {
        # Set context to the VM's subscription
        $null = Set-AzContext -SubscriptionId $SubscriptionId -ErrorAction Stop

        Write-Host "  Processing: $VMName..." -ForegroundColor Yellow

        # Get the VM and update tags
        $vm = Get-AzVM -ResourceGroupName $ResourceGroupName -Name $VMName -ErrorAction Stop

        if ($vm.Tags) {
            $Tags.Keys | ForEach-Object { $vm.Tags[$_] = $Tags[$_] }
        } else {
            $vm.Tags = $Tags
        }

        $null = Update-AzVM -VM $vm -ResourceGroupName $ResourceGroupName -Tag $vm.Tags -ErrorAction Stop
        Write-Host "  ✓ Successfully tagged $VMName for $MaintenanceWindow maintenance" -ForegroundColor Green

        return $true
    }
    catch {
        Write-Host "  ✗ Failed to tag $VMName`: $($_.Exception.Message)" -ForegroundColor Red
        return $false
    }
}

# Define all target VMs organized by maintenance window
$maintenanceGroups = @{
    "Monday" = @{
        "VMs" = @("WEB-PROD-01", "DB-PROD-01", "APP-PROD-01", "FILE-PROD-01", "DC-PROD-01")
        "Tags" = @{
            "Owner" = "Contoso"
            "Updates" = "Azure Update Manager"
            "PatchWindow" = "mon"
        }
    }
    "Tuesday" = @{
        "VMs" = @("WEB-PROD-02", "DB-PROD-02", "APP-PROD-02", "FILE-PROD-02", "DC-PROD-02")
        "Tags" = @{
            "Owner" = "Contoso"
            "Updates" = "Azure Update Manager"
            "PatchWindow" = "tue"
        }
    }
    "Wednesday" = @{
        "VMs" = @("WEB-PROD-03", "DB-PROD-03", "APP-PROD-03", "FILE-PROD-03", "DC-PROD-03")
        "Tags" = @{
            "Owner" = "Contoso"
            "Updates" = "Azure Update Manager"
            "PatchWindow" = "wed"
        }
    }
    "Thursday" = @{
        "VMs" = @("WEB-PROD-04", "DB-PROD-04", "APP-PROD-04", "FILE-PROD-04", "PRINT-PROD-01")
        "Tags" = @{
            "Owner" = "Contoso"
            "Updates" = "Azure Update Manager"
            "PatchWindow" = "thu"
        }
    }
    "Friday" = @{
        "VMs" = @("WEB-PROD-05", "DB-PROD-05", "APP-PROD-05", "FILE-PROD-05", "MONITOR-PROD-01")
        "Tags" = @{
            "Owner" = "Contoso"
            "Updates" = "Azure Update Manager"
            "PatchWindow" = "fri"
        }
    }
    "Saturday" = @{
        "VMs" = @("WEB-DEV-01", "DB-DEV-01", "APP-DEV-01", "TEST-SERVER-01", "SANDBOX-01")
        "Tags" = @{
            "Owner" = "Contoso"
            "Updates" = "Azure Update Manager"
            "PatchWindow" = "sat-09"
        }
    }
    "Sunday" = @{
        "VMs" = @("WEB-UAT-01", "DB-UAT-01", "APP-UAT-01", "BACKUP-PROD-01", "MGMT-PROD-01")
        "Tags" = @{
            "Owner" = "Contoso"
            "Updates" = "Azure Update Manager"
            "PatchWindow" = "sun"
        }
    }
}

# Function to discover VMs across all subscriptions
function Find-VMsAcrossSubscriptions {
    param([array]$TargetVMNames)

    $subscriptions = Get-AzSubscription | Where-Object { $_.State -eq "Enabled" }
    $vmInventory = @{}

    foreach ($subscription in $subscriptions) {
        try {
            $null = Set-AzContext -SubscriptionId $subscription.Id -ErrorAction Stop
            $vms = Get-AzVM -ErrorAction Continue

            foreach ($vm in $vms) {
                if ($vm.Name -in $TargetVMNames) {
                    $vmInventory[$vm.Name] = @{
                        Name = $vm.Name
                        ResourceGroupName = $vm.ResourceGroupName
                        SubscriptionId = $subscription.Id
                        SubscriptionName = $subscription.Name
                        Location = $vm.Location
                    }
                }
            }
        }
        catch {
            Write-Host "Error scanning subscription $($subscription.Name): $($_.Exception.Message)" -ForegroundColor Red
        }
    }

    return $vmInventory
}

# Get all unique VM names and discover their locations
$allTargetVMs = @()
$maintenanceGroups.Values | ForEach-Object { $allTargetVMs += $_.VMs }
$allTargetVMs = $allTargetVMs | Sort-Object -Unique

Write-Host "Discovering locations for $($allTargetVMs.Count) target VMs..." -ForegroundColor White
$vmInventory = Find-VMsAcrossSubscriptions -TargetVMNames $allTargetVMs

# Process each maintenance window
$totalSuccess = 0
$totalFailed = 0

foreach ($windowName in $maintenanceGroups.Keys) {
    $group = $maintenanceGroups[$windowName]
    Write-Host "`n=== $windowName MAINTENANCE WINDOW ===" -ForegroundColor Magenta

    foreach ($vmName in $group.VMs) {
        if ($vmInventory.ContainsKey($vmName)) {
            $vmInfo = $vmInventory[$vmName]
            $result = Set-VMMaintenanceTags -VMName $vmInfo.Name -ResourceGroupName $vmInfo.ResourceGroupName -SubscriptionId $vmInfo.SubscriptionId -Tags $group.Tags -MaintenanceWindow $windowName
            if ($result) { $totalSuccess++ } else { $totalFailed++ }
        } else {
            Write-Host "  ⚠️ VM not found: $vmName" -ForegroundColor Yellow
            $totalFailed++
        }
    }
}

Write-Host "`n=== TAGGING SUMMARY ===" -ForegroundColor Cyan
Write-Host "Successfully tagged: $totalSuccess VMs" -ForegroundColor Green
Write-Host "Failed to tag: $totalFailed VMs" -ForegroundColor Red

🧹 Handle the Stragglers

For the 12 VMs not in the original MSP schedule, I used intelligent assignment based on their function:

Click to expand: Tagging Script for Remaining Untagged VMs (Sample Script)
# Intelligent VM Tagging Script for Remaining Untagged VMs
# This script analyzes and tags the remaining VMs based on workload patterns and load balancing

$scriptStart = Get-Date

Write-Host "=== Intelligent VM Tagging for Remaining VMs ===" -ForegroundColor Cyan
Write-Host "Analyzing and tagging 26 untagged VMs with optimal maintenance window distribution..." -ForegroundColor White
Write-Host ""

# Function to safely tag a VM across subscriptions
function Set-VMMaintenanceTags {
    param(
        [string]$VMName,
        [string]$ResourceGroupName,
        [string]$SubscriptionId,
        [hashtable]$Tags,
        [string]$MaintenanceWindow
    )

    try {
        # Set context to the VM's subscription
        $currentContext = Get-AzContext
        if ($currentContext.Subscription.Id -ne $SubscriptionId) {
            $null = Set-AzContext -SubscriptionId $SubscriptionId -ErrorAction Stop
        }

        Write-Host "  Processing: $VMName..." -ForegroundColor Yellow

        # Get the VM
        $vm = Get-AzVM -ResourceGroupName $ResourceGroupName -Name $VMName -ErrorAction Stop

        # Add maintenance tags to existing tags (preserve existing tags)
        if ($vm.Tags) {
            $Tags.Keys | ForEach-Object {
                $vm.Tags[$_] = $Tags[$_]
            }
        } else {
            $vm.Tags = $Tags
        }

        # Update the VM tags
        $null = Update-AzVM -VM $vm -ResourceGroupName $ResourceGroupName -Tag $vm.Tags -ErrorAction Stop
        Write-Host "  ✓ Successfully tagged $VMName for $MaintenanceWindow maintenance" -ForegroundColor Green

        return $true
    }
    catch {
        Write-Host "  ✗ Failed to tag $VMName`: $($_.Exception.Message)" -ForegroundColor Red
        return $false
    }
}

# Define current maintenance window loads (after existing 59 VMs)
$currentLoad = @{
    "Monday" = 7
    "Tuesday" = 7 
    "Wednesday" = 10
    "Thursday" = 6
    "Friday" = 6
    "Saturday" = 17  # Dev/Test at 09:00
    "Sunday" = 6
}

Write-Host "=== CURRENT MAINTENANCE WINDOW LOAD ===" -ForegroundColor Cyan
$currentLoad.GetEnumerator() | Sort-Object Name | ForEach-Object {
    Write-Host "$($_.Key): $($_.Value) VMs" -ForegroundColor White
}

# Initialize counters for new assignments
$newAssignments = @{
    "Monday" = 0
    "Tuesday" = 0
    "Wednesday" = 0
    "Thursday" = 0
    "Friday" = 0
    "Saturday" = 0  # Will use sat-09 for dev/test
    "Sunday" = 0
}

Write-Host ""
Write-Host "=== INTELLIGENT VM GROUPING AND ASSIGNMENT ===" -ForegroundColor Cyan

# Define VM groups with intelligent maintenance window assignments
$vmGroups = @{

    # CRITICAL PRODUCTION SYSTEMS - Spread across different days
    "Critical Infrastructure" = @{
        "VMs" = @(
            @{ Name = "DC-PROD-01"; RG = "rg-infrastructure"; Sub = "Production"; Window = "Sunday"; Reason = "Domain Controller - critical infrastructure" },
            @{ Name = "DC-PROD-02"; RG = "rg-infrastructure"; Sub = "Production"; Window = "Monday"; Reason = "Domain Controller - spread from other DCs" },
            @{ Name = "BACKUP-PROD-01"; RG = "rg-backup"; Sub = "Production"; Window = "Tuesday"; Reason = "Backup Server - spread across week" }
        )
    }

    # PRODUCTION BUSINESS APPLICATIONS - Spread for business continuity
    "Production Applications" = @{
        "VMs" = @(
            @{ Name = "WEB-PROD-01"; RG = "rg-web-production"; Sub = "Production"; Window = "Monday"; Reason = "Web Server - Monday for week start" },
            @{ Name = "DB-PROD-01"; RG = "rg-database-production"; Sub = "Production"; Window = "Tuesday"; Reason = "Database Server - Tuesday" },
            @{ Name = "APP-PROD-01"; RG = "rg-app-production"; Sub = "Production"; Window = "Wednesday"; Reason = "Application Server - mid-week" }
        )
    }

    # DEV/TEST SYSTEMS - Saturday morning maintenance (like existing dev/test)
    "Development Systems" = @{
        "VMs" = @(
            @{ Name = "WEB-DEV-01"; RG = "rg-web-development"; Sub = "Development"; Window = "Saturday"; Reason = "Web Dev - join existing dev/test window" },
            @{ Name = "DB-DEV-01"; RG = "rg-database-development"; Sub = "Development"; Window = "Saturday"; Reason = "Database Dev - join existing dev/test window" },
            @{ Name = "TEST-SERVER-01"; RG = "rg-testing"; Sub = "Development"; Window = "Saturday"; Reason = "Test Server - join existing dev/test window" }
            # ... additional dev/test VMs
        )
    }
}

# Initialize counters
$totalProcessed = 0
$totalSuccess = 0
$totalFailed = 0

# Process each group
foreach ($groupName in $vmGroups.Keys) {
    $group = $vmGroups[$groupName]
    Write-Host "`n=== $groupName ===" -ForegroundColor Magenta
    Write-Host "Processing $($group.VMs.Count) VMs in this group" -ForegroundColor White

    foreach ($vmInfo in $group.VMs) {
        $window = $vmInfo.Window
        $vmName = $vmInfo.Name

        Write-Host "`n�️ $vmName → $window maintenance window" -ForegroundColor Yellow
        Write-Host "   Reason: $($vmInfo.Reason)" -ForegroundColor Gray

        # Determine subscription ID from name
        $subscriptionId = switch ($vmInfo.Sub) {
            "Production" { (Get-AzSubscription -SubscriptionName "Production").Id }
            "DevTest" { (Get-AzSubscription -SubscriptionName "DevTest").Id }
            "Identity" { (Get-AzSubscription -SubscriptionName "Identity").Id }
            "DMZ" { (Get-AzSubscription -SubscriptionName "DMZ").Id }
        }

        # Create appropriate tags based on maintenance window
        $tags = @{
            "Owner" = "Contoso"
            "Updates" = "Azure Update Manager"
        }

        if ($window -eq "Saturday") {
            $tags["PatchWindow"] = "sat-09"  # Saturday 09:00 for dev/test
        } else {
            $tags["PatchWindow"] = $window.ToLower().Substring(0,3)  # mon, tue, wed, etc.
        }

        $result = Set-VMMaintenanceTags -VMName $vmInfo.Name -ResourceGroupName $vmInfo.RG -SubscriptionId $subscriptionId -Tags $tags -MaintenanceWindow $window

        $totalProcessed++
        if ($result) { 
            $totalSuccess++
            $newAssignments[$window]++
        } else { 
            $totalFailed++ 
        }
    }
}

Write-Host ""
Write-Host "=== TAGGING SUMMARY ===" -ForegroundColor Cyan
Write-Host "Total VMs processed: $totalProcessed" -ForegroundColor White
Write-Host "Successfully tagged: $totalSuccess" -ForegroundColor Green
Write-Host "Failed to tag: $totalFailed" -ForegroundColor Red

Write-Host ""
Write-Host "=== NEW MAINTENANCE WINDOW DISTRIBUTION ===" -ForegroundColor Cyan
Write-Host "VMs added to each maintenance window:" -ForegroundColor White

$newAssignments.GetEnumerator() | Sort-Object Name | ForEach-Object {
    if ($_.Value -gt 0) {
        $newTotal = $currentLoad[$_.Key] + $_.Value
        Write-Host "$($_.Key): +$($_.Value) VMs (total: $newTotal VMs)" -ForegroundColor Green
    }
}

Write-Host ""
Write-Host "=== FINAL MAINTENANCE WINDOW LOAD ===" -ForegroundColor Cyan
$finalLoad = @{}
$currentLoad.Keys | ForEach-Object {
    $finalLoad[$_] = $currentLoad[$_] + $newAssignments[$_]
}

$finalLoad.GetEnumerator() | Sort-Object Name | ForEach-Object {
    $status = if ($_.Value -le 8) { "Green" } elseif ($_.Value -le 12) { "Yellow" } else { "Red" }
    Write-Host "$($_.Key): $($_.Value) VMs" -ForegroundColor $status
}

$grandTotal = ($finalLoad.Values | Measure-Object -Sum).Sum
Write-Host "`nGrand Total: $grandTotal VMs across all maintenance windows" -ForegroundColor White

Write-Host ""
Write-Host "=== BUSINESS LOGIC APPLIED ===" -ForegroundColor Cyan
Write-Host "✅ Critical systems spread across different days for resilience" -ForegroundColor Green
Write-Host "✅ Domain Controllers distributed to avoid single points of failure" -ForegroundColor Green
Write-Host "✅ Dev/Test systems consolidated to Saturday morning (existing pattern)" -ForegroundColor Green
Write-Host "✅ Production workstations spread to minimize user impact" -ForegroundColor Green
Write-Host "✅ Business applications distributed for operational continuity" -ForegroundColor Green
Write-Host "✅ Load balancing maintained across the week" -ForegroundColor Green

Write-Host ""
Write-Host "=== VERIFICATION STEPS ===" -ForegroundColor Cyan
Write-Host "1. Verify tags in Azure Portal across all subscriptions" -ForegroundColor White
Write-Host "2. Check that critical systems are on different days" -ForegroundColor White
Write-Host "3. Confirm dev/test systems are in Saturday morning window" -ForegroundColor White
Write-Host "4. Review production systems distribution" -ForegroundColor White

Write-Host ""
Write-Host "=== AZURE RESOURCE GRAPH VERIFICATION QUERY ===" -ForegroundColor Cyan
Write-Host "Use this query to verify all VMs are now tagged:" -ForegroundColor White
Write-Host ""
Write-Host @"
Resources
| where type == "microsoft.compute/virtualmachines"
| where tags.Updates == "Azure Update Manager"
| project name, resourceGroup, subscriptionId, 
          patchWindow = tags.PatchWindow,
          owner = tags.Owner,
          updates = tags.Updates
| sort by patchWindow, name
| summarize count() by patchWindow
"@ -ForegroundColor Gray

if ($totalFailed -eq 0) {
    Write-Host ""
    Write-Host "� ALL VMs SUCCESSFULLY TAGGED WITH INTELLIGENT DISTRIBUTION! �" -ForegroundColor Green
} else {
    Write-Host ""
    Write-Host "⚠️ Some VMs failed to tag. Please review errors above." -ForegroundColor Yellow
}

Write-Host ""
Write-Host "Script completed at $(Get-Date)" -ForegroundColor Cyan
Write-Host "Total runtime: $((Get-Date) - $scriptStart)" -ForegroundColor Gray

Key insight: I grouped VMs by function and criticality, not just by convenience. Domain controllers got spread across different days, dev/test systems joined the existing Saturday morning window, and production applications were distributed for business continuity.


🧰 Step 7 – Configure Azure Policy Prerequisites

Here's where things get interesting. Update Manager is built on compliance — but your VMs won't show up in dynamic scopes unless they meet certain prerequisites. Enter Azure Policy to save the day.

You'll need two specific built-in policies assigned at the subscription (or management group) level:

✅ Policy 1: Set prerequisites for scheduling recurring updates on Azure virtual machines

What it does: This policy ensures your VMs have the necessary configurations to participate in Azure Update Manager. It automatically:

  • Installs the Azure Update Manager extension on Windows VMs
  • Registers required resource providers
  • Configures the VM to report its update compliance status
  • Sets the patch orchestration mode appropriately

Why this matters: Without this policy, VMs won't appear in Update Manager scopes even if they're tagged correctly. The policy handles all the "plumbing" automatically.

Assignment scope: Apply this at subscription or management group level to catch all VMs.

✅ Policy 2: Configure periodic checking for missing system updates on Azure virtual machines

What it does: This is your compliance engine. It configures VMs to:

  • Regularly scan for available updates (but not install them automatically)
  • Report update status back to Azure Update Manager
  • Enable the compliance dashboard views in the portal
  • Provide the data needed for maintenance configuration targeting

Why this matters: This policy turns on the "update awareness" for your VMs. Without it, Azure Update Manager has no visibility into what patches are needed.

Assignment scope: Same as above — subscription or management group level.

🎯 Assigning the Policies

Step-by-step in Azure Portal:

  1. Navigate to Azure Policy
  2. Azure Portal → Search "Policy" → Select "Policy"

  3. Find the First Policy

  4. Left menu: Definitions
  5. Search: Set prerequisites for scheduling recurring updates
  6. Click on the policy title

  7. Assign the Policy

  8. Click Assign button
  9. Scope: Select your subscription(s)
  10. Basics: Leave policy name as default
  11. Parameters: Leave as default
  12. Remediation: ✅ Check "Create remediation task"
  13. Review + create

  14. Repeat for Second Policy

  15. Search: Configure periodic checking for missing system updates
  16. Follow same assignment process

⚠️ Important: Policy compliance can take 30+ minutes to evaluate and apply. Perfect time for that brew I mentioned earlier.

🔍 Monitoring Compliance

Once assigned, you can track compliance in Azure Policy > Compliance. Look for:

  • Non-compliant VMs that need the extension installed
  • VMs that aren't reporting update status properly
  • Any policy assignment errors that need investigation

Learn more about Azure Policy for Update Management


🧪 Step 8 – Create Dynamic Scopes in Update Manager

This is where it all comes together — and where the magic happens.

Dynamic scopes use those PatchWindow tags to assign VMs to the correct patch config automatically. No more manual VM assignment, no more "did we remember to add the new server?" conversations.

🎯 The Portal Dance

Unfortunately, as of writing, dynamic scopes can only be configured through the Azure portal — no PowerShell or ARM template support yet.

Why portal only? Dynamic scopes are still in preview, and Microsoft hasn't released the PowerShell cmdlets or ARM template schemas yet. This means you can't fully automate the deployment, but the functionality itself works perfectly.

Here's the step-by-step:

  1. Navigate to Azure Update Manager
  2. Portal → All Services → Azure Update Manager

  3. Access Maintenance Configurations

  4. Go to Maintenance Configurations (Preview)
  5. Select one of your configs (e.g., contoso-maintenance-config-vms-mon)

  6. Create Dynamic Scope

  7. Click Dynamic ScopesAdd
  8. Name: DynamicScope-Monday-VMs
  9. Description: Auto-assign Windows VMs tagged for Monday maintenance

  10. Configure Scope Settings

  11. Subscription: Select your subscription(s)
  12. Resource Type: Microsoft.Compute/virtualMachines
  13. OS Type: Windows (create separate scopes for Linux if needed)

  14. Set Tag Filters

  15. Tag Name: PatchWindow
  16. Tag Value: mon (must match your maintenance config naming)
  17. Additional filters (optional):

    • Owner = Contoso
    • Updates = Azure Update Manager
  18. Review and Create

  19. Verify the filter logic
  20. Click Create

🔄 Repeat for All Days

You'll need to create dynamic scopes for each maintenance configuration:

Maintenance Config Dynamic Scope Name Tag Filter
contoso-maintenance-config-vms-mon DynamicScope-Monday-VMs PatchWindow = mon
contoso-maintenance-config-vms-tue DynamicScope-Tuesday-VMs PatchWindow = tue
contoso-maintenance-config-vms-wed DynamicScope-Wednesday-VMs PatchWindow = wed
contoso-maintenance-config-vms-thu DynamicScope-Thursday-VMs PatchWindow = thu
contoso-maintenance-config-vms-fri DynamicScope-Friday-VMs PatchWindow = fri
contoso-maintenance-config-vms-sat DynamicScope-Saturday-VMs PatchWindow = sat-09
contoso-maintenance-config-vms-sun DynamicScope-Sunday-VMs PatchWindow = sun

🔍 Verify Dynamic Scope Assignment

Once created, you can verify the scopes are working:

  1. In the Maintenance Configuration:
  2. Go to Dynamic Scopes
  3. Check Resources tab to see matched VMs
  4. Verify expected VM count matches your tagging
  5. Wait time: Allow 15-30 minutes for newly tagged VMs to appear

  6. What success looks like:

  7. Monday scope shows 5 VMs (WEB-PROD-01, DB-PROD-01, etc.)
  8. Saturday scope shows 5 VMs (WEB-DEV-01, DB-DEV-01, etc.)
  9. No VMs showing? Check tag case sensitivity and filters

  10. In Azure Resource Graph:

MaintenanceResources
| where type == "microsoft.maintenance/configurationassignments"
| extend vmName = tostring(split(resourceId, "/")[8])
| extend configName = tostring(properties.maintenanceConfigurationId)
| project vmName, configName, resourceGroup
| order by configName, vmName
  1. Troubleshoot empty scopes:
  2. Verify subscription selection includes all your VMs
  3. Check tag spelling: PatchWindow (case sensitive)
  4. Confirm resource type filter: Microsoft.Compute/virtualMachines
  5. Wait longer - it can take up to 30 minutes

⚠️ Common Gotchas

Tag Case Sensitivity: Dynamic scopes are case-sensitive. monMonMON

Subscription Scope: Ensure you've selected all relevant subscriptions in the scope configuration.

Resource Type Filter: Don't forget to set the resource type filter — without it, you'll match storage accounts, networking, etc.

Timing: It can take 15-30 minutes for newly tagged VMs to appear in dynamic scopes.

Dynamic scope configuration docs


🚀 Step 9 – Test & Verify (The Moment of Truth)

The acid test: does it actually patch stuff properly?

🎪 Proof of Concept Test

I started conservatively — scoped contoso-maintenance-config-vms-sun to a few non-critical VMs and let it run overnight on Sunday.

Monday morning verification:

  • ✔️ Patch compliance dashboard: All green ticks
  • ✔️ Reboot timing: Machines restarted within their 4-hour window (21:00-01:00)
  • ✔️ Update logs: Activity logs showed expected patching behavior
  • ✔️ Business impact: Zero helpdesk tickets on Monday morning

📊 Full Rollout Verification

Once confident with the Sunday test, I enabled all remaining dynamic scopes and monitored the week:

Key metrics tracked:

  • Patch compliance percentage across all VMs
  • Failed patch installations (and root causes)
  • Reboot timing adherence
  • Business hours impact (spoiler: zero)

🔍 Monitoring & Validation Tools

Azure Update Manager Dashboard:

Azure Portal → Update Manager → Overview
- Patch compliance summary
- Recent patch installations
- Failed installations with details

Azure Resource Graph Queries:

// Verify all VMs have maintenance tags
Resources
| where type == "microsoft.compute/virtualmachines"
| where tags.Updates == "Azure Update Manager"
| project name, resourceGroup, subscriptionId, 
          patchWindow = tags.PatchWindow,
          owner = tags.Owner
| summarize count() by patchWindow
| order by patchWindow

// Check maintenance configuration assignments
MaintenanceResources
| where type == "microsoft.maintenance/configurationassignments"
| extend vmName = tostring(split(resourceId, "/")[8])
| extend configName = tostring(properties.maintenanceConfigurationId)
| project vmName, configName, subscriptionId
| summarize VMCount = count() by configName
| order by configName

PowerShell Verification:

# Quick check of maintenance configuration status
Get-AzMaintenanceConfiguration -ResourceGroupName "rg-maintenance-uksouth-001" | 
    Select-Object Name, MaintenanceScope, RecurEvery | 
    Format-Table -AutoSize

# Verify VM tag distribution
$subscriptions = Get-AzSubscription | Where-Object { $_.State -eq "Enabled" }
$tagSummary = @{}

foreach ($sub in $subscriptions) {
    Set-AzContext -SubscriptionId $sub.Id | Out-Null
    $vms = Get-AzVM | Where-Object { $_.Tags.PatchWindow }

    foreach ($vm in $vms) {
        $window = $vm.Tags.PatchWindow
        if (-not $tagSummary.ContainsKey($window)) {
            $tagSummary[$window] = 0
        }
        $tagSummary[$window]++
    }
}

Write-Host "=== VM DISTRIBUTION BY PATCH WINDOW ===" -ForegroundColor Cyan
$tagSummary.GetEnumerator() | Sort-Object Name | ForEach-Object {
    Write-Host "$($_.Key): $($_.Value) VMs" -ForegroundColor White
}

📈 Success Metrics

After two full weeks of operation:

  • Better control: Direct management of patch schedules and policies
  • Increased visibility: Real-time compliance dashboards vs. periodic reports
  • Reduced complexity: Native Azure tooling vs. third-party solutions

Monitor updates in Azure Update Manager


📃 Final Thoughts & Tips

Cost-neutral — No more third-party patch agents ✅ Policy-driven — Enforced consistency with Azure Policy ✅ Easily auditable — Tag-based scoping is clean and visible ✅ Scalable — New VMs auto-join patch schedules via tagging

⚠️ Troubleshooting Guide & Common Issues

Here's what I learned the hard way, so you don't have to:

Symptom Possible Cause Fix
VM not showing in dynamic scope Tag typo or case mismatch Verify PatchWindow tag exactly matches config name
Maintenance config creation fails Invalid duration format Use ISO 8601 format: "03:00" not "3 hours"
VM skipped during patching Policy prerequisites not met Check Azure Policy compliance dashboard
No updates applied despite schedule VM needs pending reboot Clear previous reboots, check update history
Dynamic scope shows zero VMs Wrong subscription scope Verify subscription selection in scope config
Extension installation failed Insufficient permissions Ensure VM contributor rights and resource provider registration
Policy compliance stuck at 0% Assignment scope too narrow Check policy is assigned at subscription level
VMs appear/disappear from scope Tag inconsistency Run tag verification script across all subscriptions

🔧 Advanced Troubleshooting Commands

Check VM Update Readiness:

# Verify VM has required extensions and configuration
$vmName = "your-vm-name"
$rgName = "your-resource-group"

$vm = Get-AzVM -Name $vmName -ResourceGroupName $rgName -Status
$vm.Extensions | Where-Object { $_.Name -like "*Update*" -or $_.Name -like "*Maintenance*" }

Validate Maintenance Configuration:

# Test maintenance configuration is properly formed
$config = Get-AzMaintenanceConfiguration -ResourceGroupName "rg-maintenance-uksouth-001" -Name "contoso-maintenance-config-vms-mon"
Write-Host "Config Name: $($config.Name)"
Write-Host "Recurrence: $($config.RecurEvery)"
Write-Host "Duration: $($config.Duration)"
Write-Host "Start Time: $($config.StartDateTime)"
Write-Host "Timezone: $($config.TimeZone)"

Policy Compliance Deep Dive:

# Check specific VMs for policy compliance
$policyName = "Set prerequisites for scheduling recurring updates on Azure virtual machines"
$assignments = Get-AzPolicyAssignment | Where-Object { $_.Properties.DisplayName -eq $policyName }
foreach ($assignment in $assignments) {
    Get-AzPolicyState -PolicyAssignmentId $assignment.PolicyAssignmentId | 
        Where-Object { $_.ComplianceState -eq "NonCompliant" } |
        Select-Object ResourceId, ComplianceState, @{Name="Reason";Expression={$_.PolicyEvaluationDetails.EvaluatedExpressions.ExpressionValue}}
}

As always, comments and suggestions welcome over on GitHub or LinkedIn. If you've migrated patching in a different way, I'd love to hear how you approached it.

Share on Share on

⚙️ Azure BCDR Review – Turning Inherited Cloud Infrastructure into a Resilient Recovery Strategy

When we inherited our Azure estate from a previous MSP, some of the key technical components were already in place — ASR was configured for a number of workloads, and backups had been partially implemented across the environment.

What we didn’t inherit was a documented or validated BCDR strategy.

There were no formal recovery plans defined in ASR, no clear failover sequences, and no evidence that a regional outage scenario had ever been modelled or tested. The building blocks were there — but there was no framework tying them together into a usable or supportable recovery posture.

This post shares how I approached the challenge of assessing and strengthening our Azure BCDR readiness. It's not about starting from scratch — it's about applying structure, logic, and realism to an environment that had the right intentions but lacked operational clarity.

Whether you're stepping into a similar setup or planning your first formal DR review, I hope this provides a practical and relatable blueprint.


🎯 Where We Started: Technical Foundations, Operational Gaps

We weren’t starting from zero — but we weren’t in a position to confidently recover the environment either.

What we found:

  • 🟢 ASR replication was partially implemented
  • 🟡 VM backups were present but inconsistent
  • ❌ No Recovery Plans existed in ASR
  • ❌ No test failovers had ever been performed
  • ⚠️ No documented RTO/RPO targets
  • ❓ DNS and Private Endpoints weren’t accounted for in DR
  • 🔒 Networking had not been reviewed for failover scenarios
  • 🚫 No capacity reservations had been made

This review was the first step in understanding whether our DR setup could work in practice — not just in theory.


🛡️ 1️⃣ Workload Protection: What’s Covered, What’s Not

Some workloads were actively replicated via ASR. Others were only backed up. Some had both, a few had neither. There was no documented logic to explain why.

Workload protection appeared to be driven by convenience or historical context — not by business impact or recovery priority.

What we needed was a structured tiering model:

  • 🧠 Which workloads are mission-critical?
  • ⏱️ Which ones can tolerate extended recovery times?
  • 📊 What RTOs and RPOs are actually achievable?

This became the foundation for everything else.


🧩 2️⃣ The Missing Operational Layer

We had technical coverage — but no operational recovery strategy.

There were no Recovery Plans in ASR. No sequencing, no post-failover validation, and no scripts or automation.

In the absence of structure, recovery would be entirely manual — relying on individual knowledge, assumptions, and good luck during a critical event.

Codifying dependencies, failover order, and recovery steps became a priority.


🌐 3️⃣ DNS, Identity and Private Endpoint Blind Spots

DNS and authentication are easy to overlook — until they break.

Our name resolution relied on internal DNS via AD-integrated zones, with no failover logic for internal record switching. No private DNS zones were in place.

Private Endpoints were widely used, but all existed in the primary region. In a DR scenario, they would become unreachable.

Identity was regionally redundant, but untested and not AZ-aware.

We needed to promote DNS, identity, and PE routing to first-class DR concerns.


💾 4️⃣ Storage and Data Access Risk

Azure Storage backed a range of services — from SFTP and app data to file shares and diagnostics.

Replication strategies varied (LRS, RA-GRS, ZRS) with no consistent logic or documentation. Critical storage accounts weren’t aligned with workload tiering.

Some workloads used Azure Files and Azure File Sync, but without defined mount procedures or recovery checks.

In short: compute could come back, but data availability wasn’t assured.


🔌 5️⃣ The Networking Piece (And Why It Matters More Than People Think)

NSGs, UDRs, custom routing, and SD-WAN all played a part in how traffic flowed.

But in DR, assumptions break quickly.

There was no documentation of network flow in the DR region, and no validation of inter-VM or service-to-service reachability post-failover.

Some services — like App Gateways, Internal Load Balancers, and Private Endpoints — were region-bound and would require re-deployment or manual intervention.

Networking wasn’t the background layer — it was core to recoverability.


📦 6️⃣ Capacity Risk: When DR Isn’t Guaranteed

VM replication is only half the story. The other half is whether those VMs can actually start during a DR event.

Azure doesn’t guarantee regional capacity unless you've pre-purchased it.

In our case, no capacity reservations had been made. That meant no assurance that our Tier 0 or Tier 1 workloads could even boot if demand spiked during a region-wide outage.

This is a quiet but critical risk — and one worth addressing early.


✅ Conclusion: From Discovery to Direction

This review wasn’t about proving whether DR was in place — it was about understanding whether it would actually work.

The tooling was present. The protection was partial. The process was missing.

By mapping out what was covered, where the gaps were, and how recovery would actually unfold, we created a baseline that gave us clarity and control.


📘 Coming Next: Documenting the Plan

In the next post, I’ll walk through how I formalised the review into a structured BCDR posture document — including:

  • 🧱 Mapping workloads by tier and impact
  • ⏳ Defining current vs target RTO/RPO
  • 🛠️ Highlighting gaps in automation, DNS, storage, and capacity
  • 🧭 Building a recovery plan roadmap
  • ⚖️ Framing cost vs risk for stakeholder alignment

If you're facing a similar situation — whether you're inheriting someone else's cloud estate or building DR into a growing environment — I hope this series helps bring structure to the complexity.

Let me know if you'd find it useful to share templates or walkthroughs in the next post.


Share on Share on

🧾 Azure BCDR – How I Turned a DR Review into a Strategic Recovery Plan

In Part 1 of this series, I shared how we reviewed our Azure BCDR posture after inheriting a partially implemented cloud estate. The findings were clear: while the right tools were in place, the operational side of disaster recovery hadn’t been addressed.

There were no test failovers, no documented Recovery Plans, no automation, and several blind spots in DNS, storage, and private access.

This post outlines how I took that review and turned it into a practical recovery strategy — one that we could share internally, align with our CTO, and use as a foundation for further work with our support partner.

To provide context, our estate is deployed primarily in the UK South Azure region, with UK West serving as the designated DR target region.

It’s not a template — it’s a repeatable, real-world approach to structuring a BCDR plan when you’re starting from inherited infrastructure, not a clean slate.


🧭 1. Why Documenting the Plan Matters

Most cloud teams can identify issues. Fewer take the time to formalise the findings in a way that supports action and alignment.

Documenting our BCDR posture gave us three things:

  • 🧠 Clarity — a shared understanding of what’s protected and what isn’t
  • 🔦 Visibility — a way to surface risk and prioritise fixes
  • 🎯 Direction — a set of realistic, cost-aware next steps

We weren’t trying to solve every problem at once. The goal was to define a usable plan we could act on, iterate, and eventually test — all while making sure that effort was focused on the right areas.


🧱 2. Starting the Document

I structured the document to speak to both technical stakeholders and senior leadership. It needed to balance operational context with strategic risk.

✍️ Core sections included

  • Executive Summary – what the document is, why it matters
  • Maturity Snapshot – a simple traffic-light view of current vs target posture
  • Workload Overview – what’s in scope and what’s protected
  • Recovery Objectives – realistic RPO/RTO targets by tier
  • Gaps and Risks – the areas most likely to cause DR failure
  • Recommendations – prioritised, actionable, and cost-aware
  • Next Steps – what we can handle internally, and what goes to the MSP

Each section followed the same principle: clear, honest, and focused on action. No fluff, no overstatements — just a straightforward review of where we stood and what needed doing.


🧩 3. Defining the Current State

Before we could plan improvements, we had to document what actually existed. This wasn’t about assumptions — it was about capturing the real configuration and coverage in Azure.

🗂️ Workload Inventory

We started by categorising all VMs and services:

  • Domain controllers
  • Application servers (web/API/backend)
  • SQL Managed Instances
  • Infrastructure services (file, render, schedulers)
  • Management and monitoring VMs

Each workload was mapped by criticality and recovery priority — not just by type.

🛡️ Protection Levels

For each workload, we recorded:

  • ✅ Whether it was protected by ASR
  • 🔁 Whether it was backed up only
  • 🚫 Whether it had no protection (with justification)

We also reviewed the geographic layout — e.g. which services were replicated into UK West, and which existed only in UK South.

🧠 Supporting Services

Beyond the VMs, we looked at:

  • Identity services (AD, domain controllers, replication health)
  • DNS architecture (AD-integrated zones, private DNS zones)
  • Private Endpoints and their region-specific availability
  • Storage account replication types (LRS, RA-GRS, ZRS)
  • Network security and routing configurations in DR

The aim wasn’t to build a full asset inventory — just to gather enough visibility to start making risk-based decisions about what mattered, and what was missing.


⏱️ 4. Setting Recovery Objectives

Once the current state was mapped, the next step was to define what “recovery” should actually look like — in terms that could be communicated, challenged, and agreed.

We focused on two key metrics:

  • RTO (Recovery Time Objective): How long can this system be offline before we see significant operational impact?
  • RPO (Recovery Point Objective): How much data loss is acceptable in a worst-case failover?

These weren’t guessed or copied from a template. We worked with realistic assumptions based on our tooling, team capability, and criticality of the services.

📊 Tiered Recovery Model

Each workload was assigned to one of four tiers:

Tier Description
Tier 0 Core infrastructure (identity, DNS, routing)
Tier 1 Mission-critical production workloads
Tier 2 Important, but not time-sensitive services
SQL MI Treated separately due to its PaaS nature

We then applied RTO and RPO targets based on what we could achieve today vs what we aim to reach with improvements.

🔥 Heatmap Example

Workload Tier RPO (Current) RTO (Current) RPO (Optimised) RTO (Optimised)
Tier 0 – Identity 5 min 60 min 5 min 30 min
Tier 1 – Prod 5 min 360 min 5 min 60 min
Tier 2 – Non-Crit 1440 min 1440 min 60 min 240 min
SQL MI 0 min 60 min 0 min 30 min

🚧 5. Highlighting Gaps and Risks

With recovery objectives defined, the gaps became much easier to identify — and to prioritise.

We weren’t trying to protect everything equally. The goal was to focus attention on the areas that introduced the highest risk to recovery if left unresolved.

⚠️ What We Flagged

  • ❌ No test failovers had ever been performed
  • ❌ No Recovery Plans existed
  • 🌐 Public-facing infrastructure only existed in one region
  • 🔒 Private Endpoints lacked DR equivalents
  • 🧭 DNS failover was manual or undefined
  • 💾 Storage accounts had inconsistent replication logic
  • 🚫 No capacity reservations existed for critical VM SKUs

Each gap was documented with its impact, priority, and remediation options.


🛠️ 6. Strategic Recommendations

We split our recommendations into what we could handle internally, and what would require input from our MSP or further investment.

📌 Internal Actions

  • Build and test Recovery Plans for Tier 0 and Tier 1 workloads
  • Improve DNS failover scripting
  • Review VM tags to reflect criticality and protection state
  • Create sequencing logic for application groups
  • Align NSGs and UDRs in DR with production

🤝 MSP-Led or Partner Support

  • Duplicate App Gateways / ILBs in UK West
  • Implement Private DNS Zones
  • Review and implement capacity reservations
  • Test runbook-driven recovery automation
  • Conduct structured test failovers across service groups

📅 7. Making It Actionable

A plan needs ownership and timelines. We assigned tasks by role and defined short-, medium-, and long-term priorities using a simple planning table.

We treat the BCDR document as a living artefact — updated quarterly, tied to change control, and used to guide internal work and partner collaboration.


🔚 8. Closing Reflections

The original goal wasn’t to build a perfect DR solution — it was to understand where we stood, make recovery realistic, and document a plan that would hold up when we needed it most.

We inherited a functional technical foundation — but needed to formalise and validate it as part of a resilient DR posture.

By documenting the estate, defining recovery objectives, and identifying where the real risks were, we turned a passive DR posture into something we could act on. We gave stakeholders clarity. We gave the support partner direction. And we gave ourselves a roadmap.


🔜 What’s Next

In the next part of this series, I’ll walk through how we executed the plan:

  • Building and testing our first Recovery Plan
  • Improving ASR coverage and validation
  • Running our first failover drill
  • Reviewing results and updating the heatmap

If you're stepping into an inherited cloud environment or starting your first structured DR review, I hope this gives you a practical view of what’s involved — and what’s achievable without overcomplicating the process.

Let me know if you'd like to see templates or report structures from this process in a future post.


Share on Share on

💰 Saving Azure Costs with Scheduled VM Start/Stop using Custom Azure Automation Runbooks

As part of my ongoing commitment to FinOps practices, I've implemented several strategies to embed cost-efficiency into the way we manage cloud infrastructure. One proven tactic is scheduling virtual machines to shut down during idle periods, avoiding unnecessary spend.

In this post, I’ll share how I’ve built out custom Azure Automation jobs to schedule VM start and stop operations. Rather than relying on Microsoft’s pre-packaged solution, I’ve developed a streamlined, purpose-built PowerShell implementation that provides maximum flexibility, transparency, and control.


✍️ Why I Chose Custom Runbooks Over the Prebuilt Solution

Microsoft provides a ready-made “Start/Stop VMs during off-hours” solution via the Automation gallery. While functional, it’s:

  • A bit over-engineered for simple needs,
  • Relatively opaque under the hood, and
  • Not ideal for environments where control and transparency are priorities.

My custom jobs:

  • Use native PowerShell modules within Azure Automation,
  • Are scoped to exactly the VMs I want via tags,
  • Provide clean logging and alerting, and
  • Keep things simple, predictable, and auditable.

🛠️ Step 1: Set Up the Azure Automation Account

🔗 Official docs: Create and manage an Azure Automation Account

  1. Go to the Azure Portal and search for Automation Accounts.
  2. Click + Create.
  3. Fill out the basics:
  4. Name: e.g. vm-scheduler
  5. Resource Group: Create new or select existing
  6. Region: Preferably where your VMs are located
  7. Enable System-Assigned Managed Identity
  8. Once created, go to the Automation Account and ensure the following modules are imported using the Modules blade in the Azure Portal:
  9. Az.Accounts
  10. Az.Compute

✅ Tip: These modules can be added from the gallery in just a few clicks via the UI—no scripting required.

💡 Prefer scripting? You can also install them using PowerShell:

Install-Module -Name Az.Accounts -Force
Install-Module -Name Az.Compute -Force
  1. Assign the Virtual Machine Contributor role to the Automation Account's managed identity at the resource group or subscription level.

⚙️ CLI or PowerShell alternatives

# Azure CLI example to create the automation account
az automation account create \
  --name vm-scheduler \
  --resource-group MyResourceGroup \
  --location uksouth \
  --assign-identity

📅 Step 2: Add VM Tags for Scheduling

Apply consistent tags to any VM you want the runbooks to manage.

Key Value
AutoStartStop devserver

You can use the Azure Portal or PowerShell to apply these tags.

⚙️ Tag VMs via PowerShell

$vm = Get-AzVM -ResourceGroupName "MyRG" -Name "myVM"
$vm.Tags["AutoStartStop"] = "devserver"
Update-AzVM -VM $vm -ResourceGroupName "MyRG"

📂 Step 3: Create the Runbooks

🔗 Official docs: Create a runbook in Azure Automation

▶️ Create a New Runbook

  1. In your Automation Account, go to Process Automation > Runbooks.
  2. Click + Create a runbook.
  3. Name it something like Stop-TaggedVMs.
  4. Choose PowerShell as the type.
  5. Paste in the code below (repeat this process for the start runbook later).

🔹 Runbook Code: Auto-Stop Based on Tags

Param
(    
    [Parameter(Mandatory=$false)][ValidateNotNullOrEmpty()]
    [String]
    $AzureVMName = "All",

    [Parameter(Mandatory=$true)][ValidateNotNullOrEmpty()]
    [String]
    $AzureSubscriptionID = "<your-subscription-id>"
)

try {
    "Logging in to Azure..."
    # Authenticate using the system-assigned managed identity of the Automation Account
    Connect-AzAccount -Identity -AccountId "<managed-identity-client-id>"
} catch {
    Write-Error -Message $_.Exception
    throw $_.Exception
}

$TagName  = "AutoStartStop"
$TagValue = "devserver"

Set-AzContext -Subscription $AzureSubscriptionID

if ($AzureVMName -ne "All") {
    $VMs = Get-AzResource -TagName $TagName -TagValue $TagValue | Where-Object {
        $_.ResourceType -like 'Microsoft.Compute/virtualMachines' -and $_.Name -like $AzureVMName
    }
} else {
    $VMs = Get-AzResource -TagName $TagName -TagValue $TagValue | Where-Object {
        $_.ResourceType -like 'Microsoft.Compute/virtualMachines'
    }
}

foreach ($VM in $VMs) {
    Stop-AzVM -ResourceGroupName $VM.ResourceGroupName -Name $VM.Name -Verbose -Force
}

🔗 Docs: Connect-AzAccount with Managed Identity

🔹 Create the Start Runbook

Duplicate the above, replacing Stop-AzVM with Start-AzVM.

🔗 Docs: Start-AzVM


🔗 Docs: Create schedules in Azure Automation

  1. Go to the Automation Account > Schedules > + Add a schedule.
  2. Create two schedules:
  3. DailyStartWeekdays — Recurs every weekday at 07:30
  4. DailyStopWeekdays — Recurs every weekday at 18:30
  5. Go to each runbook > Link to schedule > Choose the matching schedule.

📊 You can get creative here: separate schedules for dev vs UAT, or different times for different departments.


🧪 Testing Your Runbooks

You can test each runbook directly in the portal:

  • Open the runbook
  • Click Edit > Test Pane
  • Provide test parameters if needed
  • Click Start and monitor output

This is also a good time to validate:

  • The identity has permission
  • The tags are applied correctly
  • The VMs are in a stopped or running state as expected

📊 The Results

Even this lightweight automation has produced major savings in our environment. Non-prod VMs are now automatically turned off outside office hours, resulting in monthly compute savings of up to 60% without sacrificing availability during working hours.


🧠 Ideas for Further Enhancement

  • Pull tag values from a central config (e.g. Key Vault or Storage Table)
  • Add logic to check for active RDP sessions or Azure Monitor heartbeats
  • Alert via email or Teams on job success/failure
  • Track savings over time and visualize them

💭 Final Thoughts

If you’re looking for a practical, immediate way to implement FinOps principles in Azure, VM scheduling is a great place to start. With minimal setup and maximum flexibility, custom runbooks give you control without the complexity of the canned solutions.

Have you built something similar or extended this idea further? I’d love to hear about it—drop me a comment or reach out on LinkedIn.

Stay tuned for more FinOps tips coming soon!

Share on Share on

⌚Enforcing Time Zone and DST Compliance on Windows Servers Using GPO and Scheduled Tasks


🛠️ Why This Matters

Time zone misconfigurations — especially those affecting Daylight Saving Time (DST) — can cause:

  • Scheduled tasks to run early or late
  • Timestamp mismatches in logs
  • Errors in time-sensitive integrations

Windows doesn’t always honour DST automatically, particularly in Azure VMs, automated deployments, or custom images.


🔁 What’s Changed in 2025?

As of April 2025, we revised our approach to enforce time zone compliance in a cleaner, more manageable way:

  • 🧹 Removed all registry-based enforcement from existing GPOs
  • ⚙️ Executed a one-time PowerShell script to correct servers incorrectly set to UTC (excluding domain controllers)
  • ⏲️ Updated the GPO to use a Scheduled Task that sets the correct time zone at startup (GMT Standard Time)

📋 Audit Process: Time Zone and NTP Source Check

Before remediation, an audit was performed across the server estate to confirm the current time zone and time sync source for each host.

🔎 Time Zone Audit Script

# Set your target OU
$OU = "OU=Servers,DC=yourdomain,DC=local"

# Prompt for credentials once
$cred = Get-Credential

# Optional: output to file
$OutputCsv = "C:\Temp\TimeZoneAudit.csv"
$results = @()

# Get all enabled computer objects in the OU
$servers = Get-ADComputer -Filter {Enabled -eq $true} -SearchBase $OU -Properties Name | Select-Object -ExpandProperty Name

foreach ($server in $servers) {
    Write-Host "`nConnecting to $server..." -ForegroundColor Cyan
    try {
        $tzInfo = Invoke-Command -ComputerName $server -Credential $cred -ScriptBlock {
            $tz = Get-TimeZone
            $source = (w32tm /query /source) -join ''
            $status = (w32tm /query /status | Out-String).Trim()
            [PSCustomObject]@{
                ComputerName     = $env:COMPUTERNAME
                TimeZoneId       = $tz.Id
                TimeZoneDisplay  = $tz.DisplayName
                CurrentTime      = (Get-Date).ToString("yyyy-MM-dd HH:mm:ss")
                TimeSource       = $source
                SyncStatus       = $status
            }
        } -ErrorAction Stop

        $results += $tzInfo
    }
    catch {
        Write-Warning "Failed to connect to ${server}: $_"
        $results += [PSCustomObject]@{
            ComputerName     = $server
            TimeZoneId       = "ERROR"
            TimeZoneDisplay  = "ERROR"
            CurrentTime      = "N/A"
            TimeSource       = "N/A"
            SyncStatus       = "N/A"
        }
    }
}

# Output results
$results | Format-Table -AutoSize

# Save to CSV
$results | Export-Csv -NoTypeInformation -Path $OutputCsv
Write-Host "`nAudit complete. Results saved to $OutputCsv" -ForegroundColor Green

🧰 GPO-Based Scheduled Task (Preferred Solution)

Instead of relying on registry modifications, we now use a Scheduled Task deployed via Group Policy.

✅ Task Overview

  • Trigger: At Startup
  • Action: Run powershell.exe
  • Arguments:
-Command "Set-TimeZone -Id 'GMT Standard Time'"

💡 The GPO targets all domain-joined servers. Servers in isolated environments (e.g. DMZ) or not joined to the domain are excluded.


📸 Scheduled Task Screenshots

GPO Task Properties - General Tab
Fig 1: Scheduled Task created via GPO Preferences

Scheduled Task Action Details
Fig 2: PowerShell command configuring the time zone


🛠️ One-Off Remediation Script: Setting the Time Zone

For servers identified as incorrect in the audit, the following script was used to apply the fix:

# List of servers to correct (e.g., from your audit results)
$servers = @(
    "server1",
    "server2",
    "server3"
)

# Prompt for credentials if needed
$cred = Get-Credential

foreach ($server in $servers) {
    Write-Host "Setting time zone on $server..." -ForegroundColor Cyan
    try {
        Invoke-Command -ComputerName $server -Credential $cred -ScriptBlock {
            Set-TimeZone -Id "GMT Standard Time"
        } -ErrorAction Stop

        Write-Host "✔ $server: Time zone set to GMT Standard Time" -ForegroundColor Green
    }
    catch {
        Write-Warning "✖ Failed to set time zone on ${server}: $_"
    }
}

🔍 How to Verify Time Zone + DST Compliance

Use these PowerShell commands to verify:

Get-TimeZone
(Get-TimeZone).SupportsDaylightSavingTime

And for registry inspection (read-only):

Get-ItemProperty "HKLM:\SYSTEM\CurrentControlSet\Control\TimeZoneInformation" |
  Select-Object TimeZoneKeyName, DisableAutoDaylightTimeSet, DynamicDaylightTimeDisabled

Expected values:

  • TimeZoneKeyName: "GMT Standard Time"
  • DisableAutoDaylightTimeSet: 0
  • DynamicDaylightTimeDisabled: 0

🧼 Summary

To ensure consistent time zone configuration and DST compliance:

  • Use a GPO-based Scheduled Task to set GMT Standard Time at startup
  • Run a one-time audit and remediation script to fix legacy misconfigurations
  • Avoid registry edits — they’re no longer required
  • Validate using Get-TimeZone and confirm time sync via w32tm

📘 Next Steps

  • [ ] Extend to Azure Arc or Intune-managed servers
  • [ ] Monitor for changes in Windows DST behaviour in future builds
  • [ ] Automate reporting to maintain compliance across environments

🧠 Final Thoughts

This GPO+script approach delivers a clean, scalable way to enforce time zone standards and DST logic — without relying on brittle registry changes.

Let me know if you'd like help adapting this for cloud-native or hybrid environments!


Share on Share on

⏲️ Configuring UK Regional Settings on Windows Servers with PowerShell

When building out cloud-hosted or automated deployments of Windows Servers, especially for UK-based organisations, it’s easy to overlook regional settings. But these seemingly small configurations — like date/time formats, currency symbols, or keyboard layouts — can have a big impact on usability, application compatibility, and user experience.

In this post, I’ll show how I automate this using a simple PowerShell script that sets all relevant UK regional settings in one go.


🔍 Why Regional Settings Matter

Out-of-the-box, Windows often defaults to en-US settings:

  • Date format becomes MM/DD/YYYY
  • Decimal separators switch to . instead of ,
  • Currency symbols use $
  • Time zones default to US-based settings
  • Keyboard layout defaults to US (which can be infuriating!)

For UK-based organisations, this can:

  • Cause confusion in logs or spreadsheets
  • Break date parsing in scripts or apps expecting DD/MM/YYYY
  • Result in the wrong characters being typed (e.g., @ vs ")
  • Require manual fixing after deployment

Automating this ensures consistency across environments, saves time, and avoids annoying regional mismatches.


🔧 Script Overview

I created a PowerShell script that:

  • Sets the system locale and input methods
  • Configures UK date/time formats
  • Applies the British English language pack (if needed)
  • Sets the time zone to GMT Standard Time (London)

The script can be run manually, included in provisioning pipelines, or dropped into automation tools like Task Scheduler or cloud-init processes.


✅ Prerequisites

To run this script, you should have:

  • Administrator privileges
  • PowerShell 5.1+ (default on most supported Windows Server versions)
  • Optional: Internet access (if language pack needs to be added)

🔹 The Script: Set-UKRegionalSettings.ps1

# Set system locale and formats to English (United Kingdom)
Set-WinSystemLocale -SystemLocale en-GB
Set-WinUserLanguageList -LanguageList en-GB -Force
Set-Culture en-GB
Set-WinHomeLocation -GeoId 242
Set-TimeZone -Id "GMT Standard Time"

# Optional reboot prompt
Write-Host "UK regional settings applied. A reboot is recommended for all changes to take effect."

🚀 How to Use It

✈️ Option 1: Manual Execution

  1. Open PowerShell as Administrator
  2. Run the script:
.\Set-UKRegionalSettings.ps1

🔢 Option 2: Include in Build Pipeline or Image

For Azure VMs or cloud images, consider running this as part of your deployment process via:

  • Custom Script Extension in ARM/Bicep
  • cloud-init or Terraform provisioners
  • Group Policy Startup Script

⚡ Quick Tips

  • Reboot after running to ensure all settings apply across UI and system processes.
  • For non-UK keyboards (like US physical hardware), you may also want to explicitly set InputLocale.
  • Want to validate the settings? Use:
Get-WinSystemLocale
Get-Culture
Get-WinUserLanguageList
Get-TimeZone

📂 Registry Verification: Per-User and Default Settings

Registry Editor Screenshot

If you're troubleshooting or validating the configuration for specific users, regional settings are stored in the Windows Registry under:

👤 For Each User Profile

HKEY_USERS\<SID>\Control Panel\International

You can find the user SIDs by looking under HKEY_USERS or using:

Get-ChildItem Registry::HKEY_USERS

🧵 For New Users (Default Profile)

HKEY_USERS\.DEFAULT\Control Panel\International

This determines what settings new user profiles inherit on first logon.

You can script changes here if needed, but always test carefully to avoid corrupting profile defaults.


🌟 Final Thoughts

Small tweaks like regional settings might seem minor, but they go a long way in making your Windows Server environments feel localised and ready for your users.

Automating them early in your build pipeline means one less thing to worry about during post-deployment configuration.

Let me know if you want a version of this that handles multi-user scenarios or works across multiple OS versions!

Share on Share on

🕵️ Replacing SAS Tokens with User Assigned Managed Identity (UAMI) in AzCopy for Blob Uploads

Using Shared Access Signature (SAS) tokens with azcopy is common — but rotating tokens and handling them securely can be a hassle. To improve security and simplify our automation, I recently replaced SAS-based authentication in our scheduled AzCopy jobs with Azure User Assigned Managed Identity (UAMI).

In this post, I’ll walk through how to:

  • Replace AzCopy SAS tokens with managed identity authentication
  • Assign the right roles to the UAMI
  • Use azcopy login to authenticate non-interactively
  • Automate the whole process in PowerShell

🔍 Why Remove SAS Tokens?

SAS tokens are useful, but:

  • 🔑 They’re still secrets — and secrets can be leaked
  • 📅 They expire — which breaks automation when not rotated
  • 🔐 They grant broad access — unless scoped very carefully

Managed Identity is a much better approach when the copy job is running from within Azure (like an Azure VM or Automation account).


🌟 Project Goal

Replace the use of SAS tokens in an AzCopy job that uploads files from a local UNC share to Azure Blob Storage — by using a User Assigned Managed Identity.


✅ Prerequisites

To follow along, you’ll need:

  • A User Assigned Managed Identity (UAMI)
  • A Windows Server or Azure VM to run the copy job
  • Access to a local source folder or UNC share (e.g., \\fileserver\\data\\export\\)
  • AzCopy v10.7+ installed on the machine
  • Azure RBAC permissions to assign roles

ℹ️ Check AzCopy Version: Run azcopy --version to ensure you're using v10.7.0 or later, which is required for --identity-client-id support.


🔧 Step-by-Step Setup

🛠️ Step 1: Create the UAMI

✅ CLI
az identity create \
  --name my-azcopy-uami \
  --resource-group my-resource-group \
  --location <region>
✅ Portal
  1. Go to Managed Identities in the Azure Portal
  2. Click + Create and follow the wizard

🖇️ Step 2: Assign the UAMI to the Azure VM

AzCopy running on a VM must be able to assume the identity. Assign the UAMI to your VM:

✅ CLI
az vm identity assign \
  --name my-vm-name \
  --resource-group my-resource-group \
  --identities my-azcopy-uami
✅ Portal
  1. Navigate to the Virtual Machines blade
  2. Select the VM running your AzCopy script
  3. Under Settings, click Identity
  4. Go to the User assigned tab
  5. Click + Add, select your UAMI, then click Add

🔐 Step 3: Assign RBAC Permissions to UAMI

For AzCopy to function correctly with a UAMI, the following role assignments are recommended:

  • Storage Blob Data Contributor: Required for read/write blob operations
  • Storage Blob Data Reader: (Optional) For read-only scenarios or validation scripts
  • Reader: (Optional) For browsing or metadata-only permissions on the storage account

RBAC Tip: It may take up to 5 minutes for role assignments to propagate fully. If access fails initially, wait and retry.

✅ CLI
az role assignment create \
  --assignee <client-id-or-object-id> \
  --role "Storage Blob Data Contributor" \
  --scope "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<storage-account>/blobServices/default/containers/<container-name>"

az role assignment create \
  --assignee <client-id-or-object-id> \
  --role "Storage Blob Data Reader" \
  --scope "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<storage-account>"

az role assignment create \
  --assignee <client-id-or-object-id> \
  --role "Reader" \
  --scope "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<storage-account>"
✅ Portal
  1. Go to your Storage Account in the Azure Portal
  2. Click on the relevant container (or stay at the account level for broader scope)
  3. Open Access Control (IAM)
  4. Click + Add role assignment
  5. Repeat this for each role:
  6. Select Storage Blob Data Contributor, assign to your UAMI, and click Save
  7. Select Storage Blob Data Reader, assign to your UAMI, and click Save
  8. Select Reader, assign to your UAMI, and click Save

🧪 Step 4: Test AzCopy Login Using UAMI

$clientId = "<your-uami-client-id>"
& "C:\azcopy\azcopy.exe" login --identity --identity-client-id $clientId

You should see a confirmation message that AzCopy has successfully logged in.

🔍 To verify AzCopy is authenticated with the correct identity, you can run:

azcopy env

This will show the login type and confirm whether the token is being sourced from the Managed Identity.


📁 Step 5: Upload Files Using AzCopy + UAMI

Here's the PowerShell script that copies all files from a local share to the Blob container:

$clientId = "<your-uami-client-id>"

# Login with Managed Identity
& "C:\azcopy\azcopy.exe" login --identity --identity-client-id $clientId

# Run the copy job
& "C:\azcopy\azcopy.exe" copy \
  "\\\\fileserver\\data\\export\\" \
  "https://<your-storage-account>.blob.core.windows.net/<container-name>" \
  --overwrite=true \
  --from-to=LocalBlob \
  --blob-type=Detect \
  --put-md5 \
  --recursive \
  --log-level=INFO

💡 UNC Note: Double backslashes are used in PowerShell to represent UNC paths properly.

This script can be scheduled using Task Scheduler or run on demand.


⏱️ Automate with Task Scheduler (Optional)

To automate the job:

  1. Open Task Scheduler on your VM
  2. Create a New Task (not a Basic Task)
  3. Under General, select "Run whether user is logged on or not"
  4. Under Actions, add a new action to run powershell.exe
  5. Set the arguments to point to your .ps1 script
  6. Ensure the AzCopy path is hardcoded in your script

🚑 Troubleshooting Common Errors

❌ 403 AuthorizationPermissionMismatch
  • Usually means the identity doesn’t have the correct role or the role hasn’t propagated yet
  • Double-check:
  • UAMI is assigned to the VM
  • UAMI has Storage Blob Data Contributor on the correct container
  • Wait 2–5 minutes and try again
❌ azcopy : The term 'azcopy' is not recognized
  • AzCopy is not in the system PATH
  • Solution: Use the full path to azcopy.exe, like C:\azcopy\azcopy.exe

🛡️ Benefits of Switching to UAMI

  • ✅ No secrets or keys stored on disk
  • ✅ No manual token expiry issues
  • ✅ Access controlled via Azure RBAC
  • ✅ Easily scoped and auditable

🧼 Final Thoughts

Replacing AzCopy SAS tokens with UAMI is one of those small wins that pays dividends over time. Once set up, it's secure, robust, and hands-off.

Let me know if you'd like a variant of this that works from Azure Automation or a hybrid worker!


Share on Share on