Skip to content

Monitoring & Observability

UDS Core’s monitoring stack exposes configuration surfaces at two levels: built-in platform monitoring that works out of the box, and application-level uptime probes that operators configure through the Package CR.

UDS Core adds two uptime-focused dashboards to Grafana alongside its component dashboards:

DashboardDescription
UDS / Monitoring / Core UptimeAvailability status, uptime percentage, and component status timeline for UDS Core infrastructure components
UDS / Monitoring / Probe UptimeProbe uptime status timeline, percentage uptime, and TLS certificate expiration dates for all monitored endpoints

UDS Core includes endpoint probes for core services out of the box. These create Prometheus Probes automatically.

ServiceGatewayMonitored pathsProbe name
Keycloak (SSO)tenant/, /realms/uds/.well-known/openid-configurationuds-sso-tenant-uptime
Keycloak (admin)admin/uds-keycloak-admin-uptime
Grafanaadmin/healthzuds-grafana-admin-uptime

Each service has an uptime.enabled Helm value (boolean, default: true) that controls whether its default probes are created.

To disable probes for Keycloak and Grafana, add a value override in your bundle:

uds-bundle.yaml
overrides:
keycloak:
keycloak:
values:
- path: uptime.enabled
value: false
grafana:
uds-grafana-config:
values:
- path: uptime.enabled
value: false

UDS Core ships Prometheus recording rules that track the availability of core infrastructure components. These produce uds:<component>:up metrics (1 = available, 0 = unavailable) and require no user configuration. Rules are organized by layer:

  • base: Istiod, Istio CNI, ztunnel, admin and tenant ingress gateways, Pepr admission and watcher
  • monitoring: Prometheus, Alertmanager, Blackbox Exporter, Kube State Metrics, Prometheus Operator, Node Exporter, Grafana, Grafana endpoint (probe-derived)
  • logging: Loki backend, write, read, and gateway, Vector
  • identity-authorization: Keycloak, Keycloak Waypoint, Authservice, Keycloak SSO endpoint (probe-derived), Keycloak admin endpoint (probe-derived)
  • runtime-security: Falco, Falcosidekick
  • backup-restore: Velero
  • core: uds:access:up, the overall access health indicator derived from uds:keycloak_endpoint:up (probe-derived)

All endpoint probes (both built-in and application) produce standard Blackbox Exporter metrics:

MetricDescription
probe_successWhether the probe succeeded (1) or failed (0)
probe_duration_secondsTotal probe duration
probe_http_status_codeHTTP response status code
probe_ssl_earliest_cert_expirySSL certificate expiration timestamp

UDS Core ships opinionated probe alert rules in the uds-prometheus-config chart. These rules cover endpoint downtime and TLS certificate expiry for any series emitted by Blackbox Exporter probes, including built-in Core probes and application probes you configure through the Package CR.

The following rules are enabled by default:

RuleDefault forDefault thresholdDefault severityDescription
UDSProbeEndpointDown5mprobe_success == 0warningFires when a probe reports endpoint failure for longer than the configured duration
UDSProbeTLSExpiryWarning10mcertificate expires in less than 30 dayswarningFires when a healthy probe reports a TLS certificate nearing expiry
UDSProbeTLSExpiryCritical10mcertificate expires in less than 14 dayscriticalFires when a healthy probe reports a TLS certificate nearing critical expiry

All three rules preserve probe labels from the source series, such as instance and job. UDS Core also adds the following labels to support routing and filtering:

LabelValueDescription
severityvalue-specificAlertmanager routing severity set by the matching udsCoreDefaultAlerts.*.severity field
sourceblackboxIdentifies the alert as originating from Blackbox Exporter probe data
categoryprobeIdentifies the alert as a probe-focused alert rule

Use the following Helm values to tune or disable the built-in probe alert rules:

FieldTypeDefaultDescription
.enabledbooleantrueEnables or disables the full UDS Core default probe alert ruleset
.probeEndpointDown.enabledbooleantrueEnables or disables the UDSProbeEndpointDown rule
.probeEndpointDown.forstring5mSets how long probe_success == 0 must remain true before UDSProbeEndpointDown fires
.probeEndpointDown.severitystringwarningSets the severity label for UDSProbeEndpointDown
.probeTLSExpiryWarning.enabledbooleantrueEnables or disables the UDSProbeTLSExpiryWarning rule
.probeTLSExpiryWarning.forstring10mSets how long the TLS warning condition must remain true before UDSProbeTLSExpiryWarning fires
.probeTLSExpiryWarning.daysinteger30Sets the warning threshold, in days before certificate expiry
.probeTLSExpiryWarning.severitystringwarningSets the severity label for UDSProbeTLSExpiryWarning
.probeTLSExpiryCritical.enabledbooleantrueEnables or disables the UDSProbeTLSExpiryCritical rule
.probeTLSExpiryCritical.forstring10mSets how long the TLS critical condition must remain true before UDSProbeTLSExpiryCritical fires
.probeTLSExpiryCritical.daysinteger14Sets the critical threshold, in days before certificate expiry
.probeTLSExpiryCritical.severitystringcriticalSets the severity label for UDSProbeTLSExpiryCritical

The following snippet shows several examples of how the default probe alert settings can be modified:

uds-bundle.yaml
overrides:
kube-prometheus-stack:
uds-prometheus-config:
values:
# Disable all UDS Core default probe alerts
- path: udsCoreDefaultAlerts.enabled
value: false
# Disable only the endpoint-down alert
- path: udsCoreDefaultAlerts.probeEndpointDown.enabled
value: false
# Adjust TLS warning threshold and severity
- path: udsCoreDefaultAlerts.probeTLSExpiryWarning.days
value: 21
- path: udsCoreDefaultAlerts.probeTLSExpiryWarning.severity
value: warning
# Adjust TLS critical threshold and severity
- path: udsCoreDefaultAlerts.probeTLSExpiryCritical.days
value: 7
- path: udsCoreDefaultAlerts.probeTLSExpiryCritical.severity
value: critical

Applications configure uptime monitoring through the uptime block on expose entries in the Package CR. The UDS Operator creates Prometheus Probe resources and configures Blackbox Exporter automatically. For step-by-step setup, see Set up uptime monitoring.