Thursday, May 8, 2025

Setting Up Azure Monitor Alerts for AI Workload Anomalies

As organizations adopt AI workloads on Azure, the need for targeted monitoring becomes critical. Unlike traditional applications, AI services such as Azure OpenAI and Azure Machine Learning can generate significant cost spikes from a single misconfigured request or an idle compute cluster left running overnight.

Azure Monitor provides the tooling to detect and respond to these anomalies before they appear on the monthly invoice. This post walks through configuring alerts specifically for Azure OpenAI and Azure Machine Learning workloads.

1. Metric Alerts for Azure OpenAI Service

Azure OpenAI exposes several platform metrics that are useful for anomaly detection. The most relevant for cost monitoring are Processed Prompt Tokens and Processed Completion Tokens.

To create a metric alert:

  1. Navigate to your Azure OpenAI resource > Monitoring > Alerts > + Create > Alert rule
  2. Under Condition, select Add condition and search for Processed Completion Tokens
  3. Set Threshold type to Dynamic to allow Azure to learn the baseline from historical traffic
  4. Set Aggregation to Total over a 5-minute evaluation window
  5. Configure Alert sensitivity to Medium as a starting point

Dynamic thresholds are preferable for AI workloads because token consumption varies naturally with legitimate traffic. Static thresholds tend to generate excessive false positives during expected peak periods.

2. Log Alerts for Azure Machine Learning Compute

Metric alerts cover throughput anomalies, but log-based alerts can detect issues such as a compute cluster that failed to scale down after a training job completed.

Following is a KQL query that detects compute clusters that have been in a running state without an active job for more than two hours:

AmlComputeClusterEvent
| where TimeGenerated > ago(2h)
| where EventType == "ClusterStateChanged"
| where NewState == "Steady"
| summarize LastEvent = max(TimeGenerated) by ClusterName
| where LastEvent < ago(2h)

Navigate to Log Analytics workspace > Logs, validate the query, then select + New alert rule from the query toolbar to convert it into a scheduled log alert.

3. Configuring Action Groups

Action Groups define who gets notified when an alert fires. A well-configured action group ensures the right person can respond promptly.

Navigate to Azure Monitor > Alerts > Action groups > + Create and configure the following notification types:

  • Email/SMS — for the owning engineering team
  • Azure Function — for automated remediation such as scaling down an idle cluster
  • Webhook — for integration with Microsoft Teams channels or third-party incident tools

Once created, assign the action group to both the metric alert and the log alert rule configured in the previous steps.

4. Using Alert Processing Rules to Reduce Noise

As the number of alert rules grows, Alert Processing Rules help manage notification fatigue. These rules can suppress alerts during scheduled maintenance windows or route different alert severities to different action groups.

Navigate to Azure Monitor > Alerts > Alert processing rules > + Create to define suppression schedules and routing logic based on resource tags or subscription scope.

Summary

Standard monitoring configurations are not sufficient for AI workloads. Configuring dynamic metric alerts for Azure OpenAI token consumption and log-based alerts for Azure ML compute idle time ensures that anomalies are caught early, before they translate into an unexpected billing outcome.

No comments: