Ephemeral Castle Vault Bootstrap and Restore

The Vault runtime on Hetzner uses a strict classification logic to handle the transition from a fresh VPS to a fully hydrated, unsealed secret engine.

The Classification Matrix

The Ansible role vault-runtime (tasks in ansible/roles/vault-runtime/tasks/) determines the state of three domains:

  1. TazPod / operator-side canonical bootstrap set
    • empty (T0): none of the canonical bootstrap artifacts are present under ~/secrets/lushycorp-vault/
    • coherent (T1): the canonical set is complete and classifiable
    • incoherent (T2): files are partial, missing, or not one consistent lineage
  2. Local host/Vault state
    • empty (H0): no Raft data / no valid host-local Vault state
    • coherent (H1): local state is valid for the active lineage
    • incoherent (H2): partial or contradictory local state
  3. S3 remote durability
    • empty (S0): no usable lineage pointers or snapshots
    • coherent (S1): valid lineage pointers and snapshots exist
    • incoherent (S2): broken pointers, partial uploads, or mismatched metadata

Important Path Distinction

The runtime playbook classifies TazPod state from the decrypted canonical files in ~/secrets/lushycorp-vault/:

  • init.json
  • unseal-keys.json
  • root-token.txt
  • admin-token.txt
  • admin-token.json

The encrypted operator archive /workspace/.tazpod/vault/vault.tar.aes is not the classification input by itself. On 2026-04-28 this distinction mattered operationally: the encrypted archive still existed, but the canonical files above were absent, so the playbook correctly classified TazPod as T0.

Follow-up verification on the same day closed the loop:

  • the current encrypted archive was timestamped 2026-04-07, earlier than the C1/C2 Vault runtime builds
  • a direct decrypted archive listing contained zero lushycorp-vault/* entries

So the current failure is not that unlock forgot to materialize files that were already inside the encrypted archive. The stronger conclusion is that the active encrypted archive itself never contained the canonical Vault bootstrap set.

Root-Cause Gap In The Runtime Flow

The current runtime logic already does the first half correctly:

  • Phase B writes the canonical artifacts into ~/secrets/lushycorp-vault/ on the operator side.

What is missing is the second half:

  • there is no follow-up tazpod save
  • and no follow-up tazpod push vault

So the operator-side RAM view was updated during the original C1/C2 build, but the encrypted archive /workspace/.tazpod/vault/vault.tar.aes was never refreshed and pushed back to S3. That is why a later container/session reset can leave the runtime in T0 even though the earlier build had once produced a coherent canonical bootstrap set.

The runtime flow has now been corrected to run those persistence steps explicitly from /workspace, and to assert that the operator-side RAM vault is actually mounted before doing so.

Two operator-facing caveats discovered during remediation

  • tazpod save currently prints a success-style line even when the RAM vault is not mounted and no archive was refreshed.
  • tazpod push vault is cwd-sensitive if called from the wrong directory.

So the durable operational rule remains: successful Vault materialization in ~/secrets/lushycorp-vault/ is not enough by itself; the encrypted archive must be refreshed and pushed from the canonical workspace root.

Decision Matrix

TazPodHostS3Default Action
T0H0S0first init allowed
T0H0S1hard fail; recover or rebuild the canonical TazPod anchor first
T0H0S2hard fail
T1H0S1restore allowed during create.sh
T1H0S0hard fail
T1H0S2hard fail
T1H1S0local lifecycle allowed; next backup initializes S3
T1H1S1normal local lifecycle
T1H1S2local lifecycle allowed; next backup repairs S3
T2**hard fail
*H2*hard fail

Observed 2026-04-28 Destroy/Create Failure

Phase-1 validation for 09-vault-k8s-integration-prep hit the exact T0 + H0 + S1 branch.

  • destroy.sh removed the Hetzner server and local Terraform outputs
  • the new host came back empty as expected
  • the remote S3 lineage stayed coherent because destroy does not purge S3
  • the operator-side canonical bootstrap files were absent under ~/secrets/lushycorp-vault/

The runtime therefore failed correctly with the message that the TazPod bootstrap domain was empty while remote durability was coherent. The unresolved operational question is not the matrix itself; it is the approved recovery/reset path for that state.

Automated Restore Logic (C2)

If the Local VPS is destroyed but TazPod and S3 are coherent, create.sh executes the following:

  • Metadata Sync: Reads latest.json pointer from S3 to find the correct snapshot.
  • Binary Restore: Executes vault operator raft snapshot restore using a temporary local instance.
  • Passphrase Hydration: Rehydrates /var/lib/lushycorp-vault/bootstrap/ unseal shares from TazPod.
  • Final Unseal: Reruns the systemd-managed unseal helper to bring the leader node online.

This restore path requires T1 + H0 + S1. It is not available for T0 + H0 + S1.

S3 Pointers and Rotation

  • Lineage ID: A unique UUID generated at first init (vault_lineage_id).
  • Global Path: vault/raft-snapshots/latest.json.
  • Lineage Path: vault/raft-snapshots/<lineage_id>/latest.json.
  • A/B Slots: Snapshot objects rotate between slot-a and slot-b to prevent corruption.

See Also