Tuesday, February 17, 2026

Scaling Azure Container Apps with KEDA

Azure Container Apps uses KEDA (Kubernetes-based Event Driven Autoscaling) as its scaling engine. Unlike traditional CPU and memory-based autoscaling, KEDA can scale workloads in response to external event sources — queue depth, HTTP request rate, or custom metrics from virtually any system. This makes it particularly well-suited for event-driven and background processing architectures where resource demand is directly tied to the number of outstanding work items.

This post covers how Container Apps scaling works, and how to configure HTTP, Azure Queue Storage, and Azure Service Bus scaling rules for common workload patterns.

1. How Scaling Works in Container Apps

Container Apps scaling operates on two dimensions:

  • Replica count: the number of running container instances, from zero to a configured maximum
  • Scale trigger: the condition under which replicas are added or removed

Scaling to zero replicas is available on the Consumption workload profile when minReplicas is set to 0. When idle, the app consumes no compute and incurs no charge. The first incoming request after a scale-to-zero event will experience a cold start — typically 2–10 seconds depending on image size and initialisation logic.

Each scaling rule defines a target metric and a threshold per replica. Container Apps evaluates the active rules every 30 seconds and adjusts the replica count to maintain the target. If multiple scale rules are active, the rule requesting the most replicas takes precedence.

Scale rules are defined per container app, either in the Azure portal, a Bicep template, or the Azure CLI.

2. HTTP Scaling for Web-Facing Apps

HTTP scaling is the default choice for apps serving synchronous web requests. Container Apps counts concurrent HTTP requests and adds replicas when the count per replica exceeds a configured threshold.

Following is a Bicep definition for an HTTP scaling rule targeting 50 concurrent requests per replica, with a maximum of 10 replicas:

scale: {
  minReplicas: 1
  maxReplicas: 10
  rules: [
    {
      name: 'http-scaling'
      http: {
        metadata: {
          concurrentRequests: '50'
        }
      }
    }
  ]
}

concurrentRequests: '50' means a new replica is added approximately every 50 concurrent requests. Tune this value based on your app's throughput characteristics — a lower value scales more aggressively, a higher value tolerates more load per instance.

For HTTP workloads, set minReplicas to at least 1. A web-facing app that scales to zero will expose users to a cold start on the first request after idle, which is rarely acceptable for interactive workloads.

3. Azure Queue Storage Scaling for Background Workers

Workers that consume messages from Azure Queue Storage should scale in proportion to the number of waiting messages. The Azure Queue KEDA scaler reads the queue depth and scales replicas to maintain a target messages-per-replica ratio.

Following is a Bicep definition for a queue-based scaling rule targeting 5 messages per replica:

scale: {
  minReplicas: 0
  maxReplicas: 20
  rules: [
    {
      name: 'queue-scaling'
      custom: {
        type: 'azure-queue'
        metadata: {
          queueName: 'work-items'
          queueLength: '5'
          accountName: '<storage-account-name>'
        }
        auth: [
          {
            secretRef: 'storage-connection-string'
            triggerParameter: 'connection'
          }
        ]
      }
    }
  ]
}

With queueLength: '5', Container Apps scales to 10 replicas when 50 messages are queued. When the queue is empty, it scales to 0 replicas after the cool-down period (300 seconds by default).

Store the storage account connection string in a Container Apps secret and reference it via secretRef, rather than embedding it in the scaling rule metadata.

4. Azure Service Bus Scaling for Event-Driven Workloads

The Azure Service Bus KEDA scaler works similarly to the queue storage scaler but targets the active message count in a Service Bus queue or topic subscription. It is the preferred approach for Service Bus consumers because it scales based on the number of messages actually available for processing.

Following is a Bicep definition for a Service Bus scaling rule:

scale: {
  minReplicas: 0
  maxReplicas: 15
  rules: [
    {
      name: 'servicebus-scaling'
      custom: {
        type: 'azure-servicebus'
        metadata: {
          queueName: 'orders'
          messageCount: '10'
          namespace: '<servicebus-namespace>'
        }
        auth: [
          {
            secretRef: 'servicebus-connection-string'
            triggerParameter: 'connection'
          }
        ]
      }
    }
  ]
}

I recommend keeping the Service Bus connection string in Azure Key Vault and referencing it via a Container Apps secret backed by a Key Vault reference, rather than storing the value directly in the app configuration.

5. Setting Minimum Replicas and Scale-to-Zero Behaviour

The minReplicas value is the most consequential scaling decision for each workload. Setting it to 0 eliminates idle compute cost but introduces cold start latency on the first request after the app scales down. Setting it to 1 or higher ensures the app is always ready, at the cost of continuous baseline compute.

Following is a practical guide for choosing minReplicas by workload type:

Workload typeRecommended minReplicasReasoning
Queue or Service Bus consumer0Latency on first message is acceptable
Internal HTTP API1Eliminates cold start for internal callers
Customer-facing HTTP API2Eliminates cold start and provides basic redundancy
Scheduled or batch job0Triggered on a schedule; cold start is acceptable

To configure minimum and maximum replicas in the portal:

  1. Navigate to Container Apps > [app] > Scale and replicas
  2. Set Min replicas and Max replicas
  3. Select + Add under Scale rules to configure a new rule
  4. Select Save

Monitor the Replica Count metric under Container Apps > [app] > Metrics to observe scale-out and scale-in events over time, and adjust the maxReplicas ceiling if the app consistently reaches it under normal operating conditions.

Summary

KEDA gives Container Apps a flexible, event-driven scaling model that goes well beyond CPU and memory thresholds. HTTP scaling ensures web workloads respond proportionally to request volume; Queue Storage and Service Bus scaling ensure background workers process backlogs efficiently without idle compute waste. Matching minReplicas to the latency tolerance of each workload type — zero for batch processors, one or more for synchronous APIs — is the configuration decision that most directly determines both cost and responsiveness.

No comments: