Hermes Agent: LXC Deployment Architecture

Level 2 (Topic)

Concept

Hermes Agent runs as a bare-metal installation inside a hardened unprivileged Proxmox LXC container. The deployment follows the same enterprise pattern as the Hetzner Vault runtime: Terraform creates the LXC infrastructure, Ansible configures the internal state, and a top-level shell orchestrator (create.sh/destroy.sh) ties them together with structured logging and timing instrumentation.

The critical architectural challenge was achieving PVC-like data persistence on LVM-thin storage, which is not natively supported by Proxmox 9.1 for LXC containers (volumes matching vm-<vmid>-disk-* are always destroyed on container destroy). The solution is a Pet vs Cattle pattern: a dedicated “pet” container (CT 999, protection=1) that owns the persistent volume, mounted by ephemeral “cattle” containers (CT 105).

Architecture / Design

Container Model

  • Cattle — CT 105 (Hermes Agent): Unprivileged Debian 12 LXC, managed by Terraform, destroyed/recreated on each cycle
  • Pet — CT 999 (pet-storage): Minimal container (1 core, 256MB), protection=1 — never destroyed. Owns persistent volumes
  • Hermes runs as hermes user (UID 10000, non-root, no sudo) inside CT 105
  • features: nesting=true for compatibility with Hermes’ internal subprocess management

Storage Architecture

  • Cattle rootfs: 20G on local-lvm (ephemeral — destroyed with CT 105)
  • Pet rootfs: 2G on local-lvm (permanent — tied to CT 999)
  • Persistent volume: 10G LVM-thin volume local-lvm:vm-999-disk-1, owned by CT 999, mounted into CT 105 at /home/hermes via API
  • No backup/restore needed: Volume survives destroy/create because ownership is tied to CT 999, not CT 105
  • Reusable: CT 999 can own multiple volumes for different cattle containers

Lifecycle

create.sh:  Pet Ensure (2s) → Terraform Create (11s) → Wait SSH (15s) →
            Attach Volume (8s) → Ansible Baseline (56s) → Agent (30s) →
            Configure (6s) → Verify (9s)  →  TOTAL: 137s (2min 17s)

destroy.sh: Stop CT 105 → API detach mp0 → Terraform Destroy → total ~15s
            (volume vm-999-disk-1 preserved on CT 999)

Key Design Decisions

  • Bare-metal install, not Docker: Docker-in-unprivileged-LXC has overlay2/ZFS fallback to vfs driver. Rootless Docker breaks network_mode: host. Direct install.sh avoids both.
  • No Bubblewrap sandbox: Requires CAP_SYS_ADMIN (dropped by hardening) and Proxmox AppArmor blocks mount(). LXC isolation alone deemed sufficient.
  • Pet vs Cattle, not bind-mount: Proxmox API rejects bind-mounts for token-based auth (HTTP 403). The pet container pattern provides true persistence with only API token access.
  • Volume attached via API on stopped container: --data-urlencode "mp0=local-lvm:vm-999-disk-1,mp=/home/hermes" on PUT /config, then start container. Hotplug (mount on running container) fails — container must be stopped first.

Pet Container (CT 999) Details

  • Managed by separate terraform-pet/main.tf (independent state)
  • protection = true prevents any accidental deletion via API or GUI
  • 1 core, 256MB RAM, 2G rootfs
  • Single purpose: own and serve the persistent 10G volume vm-999-disk-1
  • Must be created ONCE (terraform apply), then never modified

Research History

Three research sessions confirmed the Proxmox 9.1 LVM-thin behavior:

  1. API mount syntax: --data-urlencode required for values with commas. size= causes errors on existing volumes. delete=mp0 works for detach.
  2. LVM ownership: Proxmox always scans storage for vm-<vmid>-disk-* during container destroy. delete_unreferenced_disks_on_destroy flag doesn’t exist for LXC in bpg/proxmox v0.106.
  3. Proxmox 9.1 specifics: destroy-unreferenced-disks=0 ignored for LXC. Bind-mounts blocked (403 for tokens). No API rename endpoint. ZFS not available on this host.

Full research assets are in AGENTS.ctx/crisp-build/assets/.

Iteration History

IterationFeatureStatus
1LXC base + networking + SSHDone
2Hermes bare-metal install (Playwright dep pre-install, –skip-setup)Done
3Configuration (opencode-go backend, systemd units, dashboard :9119)Done
4aManaged volume + backup/restoreDeprecated
4bPet vs Cattle persistence (current)Done
5Bubblewrap sandboxCancelled

Known Issues / Technical Debt

  • Dashboard Web UI npm build may fail on first start — run cd web && npm install && npm run build manually inside the container
  • CT 999 has protection=1 — must disable protection before any maintenance operation
  • Volume attach requires container to be stopped (hotplug not supported)

Reference Blocks

Create API call (attach existing pet volume)

curl -sk -X PUT "https://proxmox:8006/api2/json/nodes/tazlab/lxc/105/config" \
  --data-urlencode "mp0=local-lvm:vm-999-disk-1,mp=/home/hermes"

Destroy API call (detach, preserve volume)

curl -sk -X PUT "https://proxmox:8006/api2/json/nodes/tazlab/lxc/105/config" \
  -d "delete=mp0"

ansible.cfg SSH keepalive

[ssh_connection]
ssh_args = -o ServerAliveInterval=30 -o ServerAliveCountMax=3 -o ConnectTimeout=10

See Also