Monitoring & Observability
UDS Core’s monitoring stack exposes configuration surfaces at two levels: built-in platform monitoring that works out of the box, and application-level uptime probes that operators configure through the Package CR.
Built-in monitoring
Section titled “Built-in monitoring”Grafana dashboards
Section titled “Grafana dashboards”UDS Core adds two uptime-focused dashboards to Grafana alongside its component dashboards:
| Dashboard | Description |
|---|---|
| UDS / Monitoring / Core Uptime | Availability status, uptime percentage, and component status timeline for UDS Core infrastructure components |
| UDS / Monitoring / Probe Uptime | Probe uptime status timeline, percentage uptime, and TLS certificate expiration dates for all monitored endpoints |
Default uptime probes
Section titled “Default uptime probes”UDS Core includes endpoint probes for core services out of the box. These create Prometheus Probes automatically.
| Service | Gateway | Monitored paths | Probe name |
|---|---|---|---|
| Keycloak (SSO) | tenant | /, /realms/uds/.well-known/openid-configuration | uds-sso-tenant-uptime |
| Keycloak (admin) | admin | / | uds-keycloak-admin-uptime |
| Grafana | admin | /healthz | uds-grafana-admin-uptime |
Disabling default probes
Section titled “Disabling default probes”Each service has an uptime.enabled Helm value (boolean, default: true) that controls whether its default probes are created.
To disable probes for Keycloak and Grafana, add a value override in your bundle:
overrides: keycloak: keycloak: values: - path: uptime.enabled value: false grafana: uds-grafana-config: values: - path: uptime.enabled value: falseRecording rules
Section titled “Recording rules”UDS Core ships Prometheus recording rules that track the availability of core infrastructure components. These produce uds:<component>:up metrics (1 = available, 0 = unavailable) and require no user configuration. Rules are organized by layer:
- base: Istiod, Istio CNI, ztunnel, admin and tenant ingress gateways, Pepr admission and watcher
- monitoring: Prometheus, Alertmanager, Blackbox Exporter, Kube State Metrics, Prometheus Operator, Node Exporter, Grafana, Grafana endpoint (probe-derived)
- logging: Loki backend, write, read, and gateway, Vector
- identity-authorization: Keycloak, Keycloak Waypoint, Authservice, Keycloak SSO endpoint (probe-derived), Keycloak admin endpoint (probe-derived)
- runtime-security: Falco, Falcosidekick
- backup-restore: Velero
- core:
uds:access:up, the overall access health indicator derived fromuds:keycloak_endpoint:up(probe-derived)
Probe metrics
Section titled “Probe metrics”All endpoint probes (both built-in and application) produce standard Blackbox Exporter metrics:
| Metric | Description |
|---|---|
probe_success | Whether the probe succeeded (1) or failed (0) |
probe_duration_seconds | Total probe duration |
probe_http_status_code | HTTP response status code |
probe_ssl_earliest_cert_expiry | SSL certificate expiration timestamp |
Default probe alert rules
Section titled “Default probe alert rules”UDS Core ships opinionated probe alert rules in the uds-prometheus-config chart. These rules cover endpoint downtime and TLS certificate expiry for any series emitted by Blackbox Exporter probes, including built-in Core probes and application probes you configure through the Package CR.
Shipped rules
Section titled “Shipped rules”The following rules are enabled by default:
| Rule | Default for | Default threshold | Default severity | Description |
|---|---|---|---|---|
UDSProbeEndpointDown | 5m | probe_success == 0 | warning | Fires when a probe reports endpoint failure for longer than the configured duration |
UDSProbeTLSExpiryWarning | 10m | certificate expires in less than 30 days | warning | Fires when a healthy probe reports a TLS certificate nearing expiry |
UDSProbeTLSExpiryCritical | 10m | certificate expires in less than 14 days | critical | Fires when a healthy probe reports a TLS certificate nearing critical expiry |
All three rules preserve probe labels from the source series, such as instance and job. UDS Core also adds the following labels to support routing and filtering:
| Label | Value | Description |
|---|---|---|
severity | value-specific | Alertmanager routing severity set by the matching udsCoreDefaultAlerts.*.severity field |
source | blackbox | Identifies the alert as originating from Blackbox Exporter probe data |
category | probe | Identifies the alert as a probe-focused alert rule |
Configuration surface
Section titled “Configuration surface”Use the following Helm values to tune or disable the built-in probe alert rules:
| Field | Type | Default | Description |
|---|---|---|---|
.enabled | boolean | true | Enables or disables the full UDS Core default probe alert ruleset |
.probeEndpointDown.enabled | boolean | true | Enables or disables the UDSProbeEndpointDown rule |
.probeEndpointDown.for | string | 5m | Sets how long probe_success == 0 must remain true before UDSProbeEndpointDown fires |
.probeEndpointDown.severity | string | warning | Sets the severity label for UDSProbeEndpointDown |
.probeTLSExpiryWarning.enabled | boolean | true | Enables or disables the UDSProbeTLSExpiryWarning rule |
.probeTLSExpiryWarning.for | string | 10m | Sets how long the TLS warning condition must remain true before UDSProbeTLSExpiryWarning fires |
.probeTLSExpiryWarning.days | integer | 30 | Sets the warning threshold, in days before certificate expiry |
.probeTLSExpiryWarning.severity | string | warning | Sets the severity label for UDSProbeTLSExpiryWarning |
.probeTLSExpiryCritical.enabled | boolean | true | Enables or disables the UDSProbeTLSExpiryCritical rule |
.probeTLSExpiryCritical.for | string | 10m | Sets how long the TLS critical condition must remain true before UDSProbeTLSExpiryCritical fires |
.probeTLSExpiryCritical.days | integer | 14 | Sets the critical threshold, in days before certificate expiry |
.probeTLSExpiryCritical.severity | string | critical | Sets the severity label for UDSProbeTLSExpiryCritical |
The following snippet shows several examples of how the default probe alert settings can be modified:
overrides: kube-prometheus-stack: uds-prometheus-config: values: # Disable all UDS Core default probe alerts - path: udsCoreDefaultAlerts.enabled value: false # Disable only the endpoint-down alert - path: udsCoreDefaultAlerts.probeEndpointDown.enabled value: false # Adjust TLS warning threshold and severity - path: udsCoreDefaultAlerts.probeTLSExpiryWarning.days value: 21 - path: udsCoreDefaultAlerts.probeTLSExpiryWarning.severity value: warning # Adjust TLS critical threshold and severity - path: udsCoreDefaultAlerts.probeTLSExpiryCritical.days value: 7 - path: udsCoreDefaultAlerts.probeTLSExpiryCritical.severity value: criticalApplication uptime probes
Section titled “Application uptime probes”Applications configure uptime monitoring through the uptime block on expose entries in the Package CR. The UDS Operator creates Prometheus Probe resources and configures Blackbox Exporter automatically. For step-by-step setup, see Set up uptime monitoring.
Related documentation
Section titled “Related documentation”- Monitoring & Observability concepts: high-level overview of the monitoring stack
- Set up uptime monitoring: configure application uptime probes
- Capture application metrics: expose metrics from your application for Prometheus scraping
- Create metric alerting rules: define custom
PrometheusRulealerts and tune built-in defaults - Create log-based alerting and recording rules: configure Loki Ruler alerts and recording rules
- Route alerts to notification channels: configure Alertmanager receivers and routing
- Add custom dashboards: deploy Grafana dashboards alongside your application
- Add Grafana datasources: connect additional data sources to Grafana