TazLab K8s: Monitoring Detail

Level 3 (Detail) — Prometheus, Grafana, dashboards, metrics-server.

Concept

The monitoring stack uses kube-prometheus-stack (Prometheus + Grafana + Alertmanager) managed as HelmRelease. Grafana uses the shared PostgreSQL instance (tazlab-db) as its backend, making dashboards persistent across pod restarts. Dashboards are managed as ConfigMaps loaded by a sidecar.

HelmRelease

File: infrastructure/operators/monitoring/helmrelease.yaml

FieldValue
Chartkube-prometheus-stack
Repositoryprometheus-community
Namespacemonitoring

Manifests

File: infrastructure/operators/monitoring/

FilePurpose
helmrepository.yamlHelm repository reference
helmrelease.yamlkube-prometheus-stack installation
namespace.yamlmonitoring namespace
metrics-server.yamlResource metrics (CPU/memory)
flux-secret-sync.yamlGrafana admin credentials sync
grafana-ingress.yamlGrafana HTTPS ingress
dashboards/cluster-health.yamlCluster health dashboard ConfigMap
dashboards/nodes-pro.yamlDetailed node metrics dashboard

Grafana PostgreSQL Backend

Grafana is configured to use tazlab-db as its database backend (database grafana, user grafana). This ensures:

  • Dashboard configurations survive pod restarts
  • User sessions and preferences are persistent
  • No PVC dependency for Grafana itself

The grafana-bootstrap-secret Secret (referenced by infrastructure-instances Kustomization’s substituteFrom) provides initial admin credentials.

Dashboards as Code

Dashboards are stored as ConfigMaps with label grafana_dashboard: "1". Grafana’s sidecar watches for this label and loads dashboards automatically.

Current dashboards:

  • cluster-health.yaml — High-level node and pod metrics
  • nodes-pro.yaml — Detailed hardware and kernel metrics

metrics-server

File: infrastructure/operators/monitoring/metrics-server.yaml

Provides resource metrics used by kubectl top and HorizontalPodAutoscaler. Installed alongside kube-prometheus-stack.

flux-secret-sync

Syncs Grafana admin credentials from ExternalSecret to the Grafana deployment. Ensures the initial admin password is available before Grafana starts.

DAG Position

operators-namespaces (Level 0, creates monitoring namespace)
→ monitoring (Level 1, installs kube-prometheus-stack and metrics-server)
→ configs (Level 2, creates S3 backup ExternalSecret for tazlab-db)
→ instances (Level 3, creates Grafana ingress + syncs grafana-bootstrap-secret)

Prometheus Status Note

As of the 2026-04-29 power loss recovery, Prometheus uses a manually salvaged Longhorn volume. The volume is healthy and the pod is 2/2 Running. Future consideration: if monitoring data becomes important, the Prometheus PVC should use 2 replicas.

See Also