Ephemeral Castle Vault Bootstrap and Restore
The Vault runtime on Hetzner uses a strict classification logic to handle the transition from a fresh VPS to a fully hydrated, unsealed secret engine.
The Classification Matrix
The Ansible role vault-runtime (tasks in ansible/roles/vault-runtime/tasks/) determines the state of three domains:
- TazPod / operator-side canonical bootstrap set
- empty (
T0): none of the canonical bootstrap artifacts are present under~/secrets/lushycorp-vault/ - coherent (
T1): the canonical set is complete and classifiable - incoherent (
T2): files are partial, missing, or not one consistent lineage
- empty (
- Local host/Vault state
- empty (
H0): no Raft data / no valid host-local Vault state - coherent (
H1): local state is valid for the active lineage - incoherent (
H2): partial or contradictory local state
- empty (
- S3 remote durability
- empty (
S0): no usable lineage pointers or snapshots - coherent (
S1): valid lineage pointers and snapshots exist - incoherent (
S2): broken pointers, partial uploads, or mismatched metadata
- empty (
Important Path Distinction
The runtime playbook classifies TazPod state from the decrypted canonical files in ~/secrets/lushycorp-vault/:
init.jsonunseal-keys.jsonroot-token.txtadmin-token.txtadmin-token.json
The encrypted operator archive /workspace/.tazpod/vault/vault.tar.aes is not the classification input by itself. On 2026-04-28 this distinction mattered operationally: the encrypted archive still existed, but the canonical files above were absent, so the playbook correctly classified TazPod as T0.
Follow-up verification on the same day closed the loop:
- the current encrypted archive was timestamped
2026-04-07, earlier than the C1/C2 Vault runtime builds - a direct decrypted archive listing contained zero
lushycorp-vault/*entries
So the current failure is not that unlock forgot to materialize files that were already inside the encrypted archive. The stronger conclusion is that the active encrypted archive itself never contained the canonical Vault bootstrap set.
Root-Cause Gap In The Runtime Flow
The current runtime logic already does the first half correctly:
- Phase B writes the canonical artifacts into
~/secrets/lushycorp-vault/on the operator side.
What is missing is the second half:
- there is no follow-up
tazpod save - and no follow-up
tazpod push vault
So the operator-side RAM view was updated during the original C1/C2 build, but the encrypted archive /workspace/.tazpod/vault/vault.tar.aes was never refreshed and pushed back to S3. That is why a later container/session reset can leave the runtime in T0 even though the earlier build had once produced a coherent canonical bootstrap set.
The runtime flow has now been corrected to run those persistence steps explicitly from /workspace, and to assert that the operator-side RAM vault is actually mounted before doing so.
Two operator-facing caveats discovered during remediation
tazpod savecurrently prints a success-style line even when the RAM vault is not mounted and no archive was refreshed.tazpod push vaultis cwd-sensitive if called from the wrong directory.
So the durable operational rule remains: successful Vault materialization in ~/secrets/lushycorp-vault/ is not enough by itself; the encrypted archive must be refreshed and pushed from the canonical workspace root.
Decision Matrix
| TazPod | Host | S3 | Default Action |
|---|---|---|---|
T0 | H0 | S0 | first init allowed |
T0 | H0 | S1 | hard fail; recover or rebuild the canonical TazPod anchor first |
T0 | H0 | S2 | hard fail |
T1 | H0 | S1 | restore allowed during create.sh |
T1 | H0 | S0 | hard fail |
T1 | H0 | S2 | hard fail |
T1 | H1 | S0 | local lifecycle allowed; next backup initializes S3 |
T1 | H1 | S1 | normal local lifecycle |
T1 | H1 | S2 | local lifecycle allowed; next backup repairs S3 |
T2 | * | * | hard fail |
* | H2 | * | hard fail |
Observed 2026-04-28 Destroy/Create Failure
Phase-1 validation for 09-vault-k8s-integration-prep hit the exact T0 + H0 + S1 branch.
destroy.shremoved the Hetzner server and local Terraform outputs- the new host came back empty as expected
- the remote S3 lineage stayed coherent because destroy does not purge S3
- the operator-side canonical bootstrap files were absent under
~/secrets/lushycorp-vault/
The runtime therefore failed correctly with the message that the TazPod bootstrap domain was empty while remote durability was coherent. The unresolved operational question is not the matrix itself; it is the approved recovery/reset path for that state.
Automated Restore Logic (C2)
If the Local VPS is destroyed but TazPod and S3 are coherent, create.sh executes the following:
- Metadata Sync: Reads
latest.jsonpointer from S3 to find the correct snapshot. - Binary Restore: Executes
vault operator raft snapshot restoreusing a temporary local instance. - Passphrase Hydration: Rehydrates
/var/lib/lushycorp-vault/bootstrap/unseal shares from TazPod. - Final Unseal: Reruns the systemd-managed unseal helper to bring the leader node online.
This restore path requires T1 + H0 + S1. It is not available for T0 + H0 + S1.
S3 Pointers and Rotation
- Lineage ID: A unique UUID generated at first
init(vault_lineage_id). - Global Path:
vault/raft-snapshots/latest.json. - Lineage Path:
vault/raft-snapshots/<lineage_id>/latest.json. - A/B Slots: Snapshot objects rotate between
slot-aandslot-bto prevent corruption.
See Also
- Architecture: Vault Architecture
- Secret Flow: TazLab Secret Flow
- Hub: Ephemeral Castle Hub