Ephemeral Castle Rebirth Protocol

The Rebirth Protocol (create.sh V11.1) is a high-level orchestrator that moves the cluster from “Nothing” to “Blog Online” in approximately 9-10 minutes (down from 12min after v11.1 hardening).

The create.sh Logic (Step-by-Step)

1. Secret Resolution

Uses the resolve() bash function to extract credentials from the TazPod Vault (/home/tazpod/secrets/).

  • If a file exists in the vault, it’s exported as an env var.
  • GITHUB_TOKEN is used to create a cluster-wide ghcr-pull-secret for authenticated ghcr.io pulls (avoids anonymous rate-limit).

2. Tailscale AuthKey Minting

The script executes mint_tailscale_authkey():

  • Authenticates with Tailscale OAuth (client_id + client_secret).
  • Requests a short-lived (3600s), reusable, ephemeral key with tag:tazlab-k8s.
  • The key is kept in memory and never committed to Terraform state.

3. Platform Layer (proxmox-talos module)

  • Creates Proxmox VMs (1 CP + 1 Worker) via Terraform.
  • Applies Talos machine configuration with baked-in patches:
    • cluster.apiServer.certSANs: adds lushycorp-k8s.magellanic-gondola.ts.net to K8s API server TLS cert.
    • cluster.apiServer.extraArgs: configures multi-issuer service-account-issuer, api-audiences, service-account-jwks-uri.
    • cluster.coreDNS.disabled: true: disables Talos built-in CoreDNS.
    • machine.install.image: Talos v1.12.0 installer image.
  • CRITICAL LESSON: machine.certSANs is for the Talos node API (port 50000), NOT the K8s API server (port 6443). Using machine.certSANs triggers trustd PKI regeneration that can invalidate kubelet/kube-proxy certificates, causing a cluster deadlock requiring full rebuild.

4. Engine Layer (k8s-engine module)

Layers applied AFTER Talos machines are configured:

  • ESO HelmRelease: external-secrets operator (required for CRDs).
  • CoreDNS (user-managed): ServiceAccount, ClusterRole, ClusterRoleBinding, ConfigMap (with ts.net forward → Tailscale nameserver 10.96.0.101), Deployment (2 replicas), ClusterIP Service (10.96.0.10). Replaces Talos built-in CoreDNS disabled above.
  • Vault bootstrap secrets: vault-ca-cert and vault-eso-token read directly from ~/secrets/lushycorp-vault/ via Terraform file().
  • Vault ClusterSecretStore tazlab-secrets-vault: deployed with depends_on on ESO helm release (ensures CRDs are registered).
  • GitHub token: created as direct kubernetes_secret_v1 from ~/secrets/github-token — no ExternalSecret dependency.
  • Tailscale operator OAuth: created as direct kubernetes_secret_v1 from ~/secrets/tailscale-operator-* — no ExternalSecret dependency.

5. Networking + GitOps + Storage (Parallel)

  • networking layer: MetalLB (IP pool + L2Advertisement).
  • gitops layer: Flux bootstrap using pre-created GitHub token secret. Flux DAG then reconciles: bridge -> namespaces -> operators-tailscale (uses pre-created OAuth) -> tailscale-dns -> configs (Vault store) -> instances -> auth -> apps.
  • storage layer: Longhorn (parallel with networking + gitops, non bloccante). Longhorn non dipende da MetalLB — nessun motivo tecnico per sequenziarli. Installed after networking. Creates volume backing for PVCs.

7. Post-Bootstrap: PGO → Vault Sync

After Flux deploys PGO and the database is ready:

  1. Script waits for tazlab-db-pguser-grafana secret in tazlab-db namespace.
  2. Reads the fresh PGO-generated password.
  3. Syncs it to Vault at secret/tazlab-k8s/static/monitoring/grafana/GRAFANA_DB_PASSWORD.
  4. Grafana ExternalSecret picks up the password and Grafana starts with 3/3 containers.

8. Verification

  • Waits for Traefik LoadBalancer IP assignment.
  • Runs check-blog.sh to verify the blog is responding.
  • Reports timing for each layer (approximate: platform 2m, engine 95s, networking+gitops 106s, storage 85s).

Bootstrap Deadlocks Fixed (2026-05-28)

The engine layer was restructured to eliminate bootstrap circular dependencies:

DependencyProblemFix
Operator OAuth → Vault → DNS → OperatorChicken-and-egg in fresh bootstrapOAuth secret created directly from operator files in engine layer
CoreDNS → Talos SSATalos v1.12+ Server-Side Apply overwrites manual Corefile changescoreDNS.disabled: true + user-managed CoreDNS deployed by engine
ESO CRD → Vault ClusterSecretStoreCRD not registered when Terraform applies store resourcedepends_on = [helm_release.external_secrets]
GitHub token → ESO → Vault → DNSExternalSecret chain needs DNS which needs operatorDirect kubernetes_secret_v1 from operator file, bypassing ESO
CoreDNS Corefile syntaxSingle-line health { lameduck 5s } causes CoreDNS crash loopMultiline health block in Corefile
Grafana secret missing from kustomizationsecrets.yaml not in resources list of monitoring kustomizationAdded to kustomization.yaml and pushed to master

Log Locations

Logs are stored in clusters/tazlab-k8s/proxmox/logs/ with a timestamped directory structure. Parallel Terragrunt layers output to /workspace/logs/dag-fix/layer-*.log.

See Also