Karpenter for Azure - Hands on with AKS Node Auto Provisioning

Traditional cluster autoscaling in Kubernetes has always felt a bit reactionary. For years, platform engineers have lived in a world dictated by rigid, predefined node pools. You deploy an application, wait for a pod to fail to schedule due to insufficient resources, and then watch the Cluster Autoscaler sluggishly trigger a virtual machine scale set expansion. Minutes later, generic compute arrives, which may or may not be optimally sized for what your application actually required.

With the introduction of Node Auto Provisioning (NAP) in Azure Kubernetes Service (AKS), powered by an implementation of managed Karpenter, that entire paradigm shifts. We are moving away from an infrastructure-first mentality to a workload-first reality. Instead of trying to fit square pegs into round holes with static node pools, the cluster dynamically provisions the exact virtual machines your workloads need, precisely when they need them, and terminates them the moment demand drops.

Shifting the Architecture to Workload-First

When you enable Node Auto Provisioning on an AKS cluster, you stop thinking about Virtual Machine Scale Sets under the hood. Instead, the cluster introduces native Kubernetes Custom Resource Definitions to manage your infrastructure boundaries directly from the control plane using two core resources: the AKSNodeClass and the NodePool.

The AKSNodeClass acts as your direct link to Azure’s fabric. It allows you to define provider-specific settings, such as the OS disk size, disk types, and assigned identities. Crucially, it isolates the operational infrastructure from the scheduling logic.

On top of that sits the NodePool. Unlike legacy node pools which represent a literal group of identical machines, a Karpenter-managed NodePool represents a set of declarative boundaries. It outlines the constraints, disruption budgets, and consolidation policies that govern how compute should be spun up. When a pod enters a pending state, NAP evaluates these boundaries and talks directly to the Azure API to provision a tailored machine, cutting out the middleman of traditional autoscaling groups.

Precision Sizing and SKU Management

One of the immediate benefits of transitioning to this model is the elimination of node pool sprawl. In a standard architecture, if a team needs memory-optimized nodes, a new node pool is manually provisioned. If another team needs compute-optimized instances or specific architecture types, yet another node pool is added. This quickly becomes a management nightmare for platform teams.

Node Auto Provisioning handles this natively through fluid constraints. Within a single configuration, you can establish granular SKU rules that align perfectly with your enterprise requirements or Azure subscription quotas. For instance, you can restrict your cluster to evaluate only specific compute families, like the latest generation D-series or E-series virtual machines.

When a deployment requests resources, NAP looks at the exact CPU and memory requests of that workload, filters through your allowed SKU list, and chooses the most cost-effective virtual machine size that satisfies the requirement. If a pod requests a massive block of CPU, it gets a larger node. If you scale down, Karpenter automatically handles the bin-packing, consolidating workloads onto smaller, cheaper VMs without administrative intervention.

High Availability and Zone Resilience

Operating platform infrastructure at scale means building for failure. In Azure, that translates to distributing your application footprints across multiple physical Availability Zones. Traditionally, achieving true zonal resilience required managing identical node pools replicated across zones 1, 2, and 3, which often led to severe underutilization and wasted spend.

With managed Karpenter on AKS, zone awareness becomes an inherent part of the scheduling cycle. By integrating topology constraints directly into your application deployments, you can instruct the cluster to maintain strict high-availability boundaries.

When you scale an application, NAP dynamically calculates the current distribution of your pods across the infrastructure. If a topology constraint dictates that the next pod must land in a specific zone to maintain balance, NAP will explicitly spin up a new virtual machine inside that exact Azure Availability Zone. You no longer have to pre-provision idle standby nodes in every zone just in case a scale event occurs; the infrastructure matches the architectural topology rules on demand.

Blending On-Demand and Spot for FinOps Success

Optimizing cloud economics is a core pillar of modern platform engineering, but it often comes with a trade-off in stability. Running workloads entirely on Azure Spot virtual machines can dramatically slash your compute bill, but the inherent risk of eviction means you cannot run core services there safely.

Node Auto Provisioning elegantly solves this problem by allowing platform teams to define multiple, weighted node configurations within the same cluster. This enables a sophisticated capacity blending strategy where you can seamlessly mix On-Demand and Spot instances based on the nature of the workload.

By assigning weights to different node definitions, you can instruct the cluster to aggressively favor lower-cost Spot instances for transient, fault-tolerant applications or development environments. If Spot capacity becomes unavailable in a specific region or if Azure issues an eviction notice, Karpenter automatically detects the disruption and falls back to provisioning standard On-Demand instances. This ensures your applications remain highly available while maintaining the absolute lowest possible price point.

Conclusion

Transitioning your internal engineering platform to AKS Node Auto Provisioning is more than just an upgrade to your autoscaling speed; it is a fundamental modernization of how cloud-native compute is managed. By treating infrastructure as a fluid, just-in-time resource managed natively via Kubernetes CRDs, you completely eliminate the operational overhead of maintaining static node pools.

The combination of granular SKU management, automated zonal awareness, and intelligent Spot blending means your cluster is consistently optimized for both performance and cost. As you build out automated developer platforms, embedding managed Karpenter into your core architecture ensures that your underlying infrastructure seamlessly scales alongside application velocity, providing a robust, hands-off foundation for your engineering teams.

You can find all the code sample on GitHub.

This post was AI-generated based on the transcript of the video “Karpenter for Azure - Hands on with AKS Node Auto Provisioning”.

Karpenter for Azure - Hands on with AKS Node Auto Provisioning

Shifting the Architecture to Workload-First

Precision Sizing and SKU Management

High Availability and Zone Resilience

Blending On-Demand and Spot for FinOps Success

Conclusion

Video - Karpenter for Azure - Hands on with AKS Node Auto Provisioning

Further Reading

Video - Karpenter for Azure - Hands on with AKS Node Auto Provisioning

Microservice Series - From Zero to Hero

AKS ACNS Walkthrough - L7 Firewalls, mTLS, and eBPF Acceleration