Skip to content

Create metric alerting rules

Define alerting conditions based on Prometheus metrics using PrometheusRule CRDs. Alerts are automatically picked up by the Prometheus Operator and routed to Alertmanager.

  • UDS CLI installed
  • UDS Registry account created and authenticated locally with a read token
  • Access to a Kubernetes cluster with UDS Core deployed
  • Familiarity with PromQL

UDS Core ships default alerting rules from two sources. The upstream kube-prometheus-stack chart provides cluster and node health alerts, and UDS Core provides default probe alerts for endpoint downtime and TLS certificate expiry.

Runbooks for upstream defaults are available at runbooks.prometheus-operator.dev. This guide covers creating custom rules for your applications and optionally tuning either default set.

  1. Create a PrometheusRule

    Define a PrometheusRule custom resource containing one or more alert rules. The Prometheus Operator watches for these CRs and loads them into Prometheus automatically.

    my-app-alerts.yaml
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
    name: my-app-alerts
    namespace: my-app
    spec:
    groups:
    - name: my-app
    rules:
    - alert: PodRestartingFrequently
    expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "Pod {{ $labels.pod }} is restarting frequently"
    runbook: "https://example.com/runbooks/pod-restart"
    description: "Pod restarted {{ $value }} times in the last hour"
    - alert: HighMemoryUsage
    expr: |
    (container_memory_working_set_bytes / on(namespace, pod, container) kube_pod_container_resource_limits{resource="memory"}) * 100 > 80
    for: 15m
    labels:
    severity: warning
    annotations:
    summary: "High memory usage detected"
    runbook: "https://example.com/runbooks/high-memory-usage"
    description: "Container using {{ $value }}% of memory limit"

    Key fields in each rule:

    • expr: PromQL expression that defines the alerting condition. When this expression returns results, the alert becomes active.
    • for: How long the condition must be continuously true before the alert fires. Prevents flapping on transient spikes.
    • labels.severity: Used by Alertmanager for routing. Common values are critical, warning, and info.
    • annotations: Human-readable context attached to the alert. Include a summary, description, and runbook URL to make alerts actionable.
  2. Deploy the rule

    (Recommended) Include the PrometheusRule in your Zarf package and create/deploy. See Packaging applications for general packaging guidance.

    Terminal window
    uds zarf package create --confirm
    uds zarf package deploy zarf-package-*.tar.zst --confirm

    Or apply the PrometheusRule directly for quick testing:

    Terminal window
    uds zarf tools kubectl apply -f my-app-alerts.yaml

    The Prometheus Operator picks up PrometheusRule CRs automatically.

  3. (Optional) Disable or tune default alert rules

    If default alerts are too noisy or not relevant to your environment, you can tune both upstream kube-prometheus-stack and UDS Core defaults through bundle overrides.

    UDS Core default probe alerts can be tuned or disabled as follows:

    uds-bundle.yaml
    overrides:
    kube-prometheus-stack:
    uds-prometheus-config:
    values:
    # Disable all UDS Core probe default alerts
    - path: udsCoreDefaultAlerts.enabled
    value: false
    # Disable Endpoint Down alert
    - path: udsCoreDefaultAlerts.probeEndpointDown.enabled
    value: false
    # Tune threshold and severity for TLS expiry warning alerts
    - path: udsCoreDefaultAlerts.probeTLSExpiryWarning.days
    value: 21
    - path: udsCoreDefaultAlerts.probeTLSExpiryWarning.severity
    value: warning
    # Tune threshold and severity for TLS expiry critical alerts
    - path: udsCoreDefaultAlerts.probeTLSExpiryCritical.days
    value: 7
    - path: udsCoreDefaultAlerts.probeTLSExpiryCritical.severity
    value: critical

    Upstream kube-prometheus-stack default rules can be disabled as follows:

    uds-bundle.yaml
    overrides:
    kube-prometheus-stack:
    kube-prometheus-stack:
    values:
    # Disable specific individual rules by name
    - path: defaultRules.disabled
    value:
    KubeControllerManagerDown: true
    KubeSchedulerDown: true
    # Disable entire rule groups with boolean toggles
    - path: defaultRules.rules.kubeControllerManager
    value: false
    - path: defaultRules.rules.kubeSchedulerAlerting
    value: false

    Use defaultRules.disabled for fine-tuned control over upstream individual rules. Use defaultRules.rules.* to disable upstream rule groups when broader changes are needed.

    Create and deploy your bundle:

    Terminal window
    uds create <path-to-bundle-dir>
    uds deploy uds-bundle-<name>-<arch>-<version>.tar.zst

Open Grafana and navigate to Alerting > Alert rules. Filter by the Prometheus datasource. Confirm your custom rules appear in the list.

Check the rule state to understand its current status:

  • Inactive: condition is not met
  • Pending: condition is met but the for duration has not elapsed
  • Firing: active alert being sent to Alertmanager

Symptom: Custom alert rules do not show up in the Grafana alerting UI.

Solution: Verify the PrometheusRule CR was created successfully and check for YAML syntax errors:

Terminal window
uds zarf tools kubectl get prometheusrule -A
uds zarf tools kubectl describe prometheusrule <name> -n <namespace>

Symptom: The PromQL expression should match, but the alert stays in Inactive state.

Solution: Verify the PromQL expression returns results in the Prometheus UI:

Terminal window
uds zarf connect prometheus

Navigate to the Graph tab and run your expr query directly. If it returns results, check that the for duration has elapsed, because the alert will remain in Pending state until the condition is continuously true for that period.