Architecture
How the pieces fit together at runtime. Control plane, data plane, storage, agent internals, and the two-server HA topology. Everything here is derived directly from the running code, not aspirational.
Top-down view
login.quickztna.com
(Caddy auto-TLS, HTTP/3)
│
┌───────────────┼───────────────┐
▼ ▼ ▼
React Vite SPA Hono/Node API WebSocket RT
(static CDN) (Docker, :3000) (same process)
│
┌───────────────┼───────────────┐
▼ ▼ ▼
PostgreSQL 16 Valkey (Redis) Cloudflare R2
(primary + HA (cache, pub/sub, (object storage,
replica) rate limits) backups, recordings)
Data plane (peer-to-peer, server-assisted only):
STUN (UDP 3478) → NAT discovery
DERP (WSS 443) → relay fallback (4 regions)
WireGuard → direct encrypted tunnels, PQC hybrid Control plane
Frontend — React SPA
Vite + React 18 + TypeScript + shadcn/ui + Tailwind CSS. 70+ pages, code-split per route.
- API client — Supabase-style chainable query builder (
api.from("machines").eq(...)) - Auth context — access token in memory, refresh via
__Host-cookie - Org context — selected org loaded on auth; every query scoped to
org_id - Realtime — single WebSocket with per-channel fan-out
Build output is static files served by Caddy from /srv/frontend.
Backend — Hono / Node.js 20
One process, many handlers. 75 handler files in backend/src/handlers/, each exports handleX(request, env): Response.
- Router —
backend/src/router.ts, 100+ routes with middleware ordering (logging → CORS → CSRF → auth → handler) - Envelope — every response shaped as
{ success, data, error } - Database —
pgifylayer converts?placeholders to$1/$2. Query patterns:.first(),.all(),.run(),.batch() - Migrations —
backend/migrations/*.sql, auto-applied on startup (48 migrations to date)
State stores
- PostgreSQL 16 — source of truth. Multi-tenant via
org_idforeign keys. Streaming replication to standby. - Valkey (Redis-compatible) — session KV (
refresh:<token>,jwt_blacklist:<user>), rate limits, WebSocket pub/sub for cross-server broadcast. - Cloudflare R2 — object storage for session recordings, audit exports, PG backups.
- PgBouncer connection pool at
:6432in front of PG:5432.
Data plane
WireGuard mesh
Every peer has a static WireGuard key pair. Tunnels are peer-to-peer direct
whenever NAT allows. The PQC hybrid key exchange (X25519 + ML-KEM-768,
combined via HKDF-SHA256) produces the WireGuard PSK. Routing is kernel on
Linux, userspace (wireguard-go) everywhere else.
DERP relays
4 global WebSocket-over-TLS regions for peers that can't reach each other directly:
India (Bangalore)
US East (New York)
Europe (London)
US West (San Francisco)
DERP is TLS 1.3 minimum with X25519Kyber768 hybrid KEM enabled at the TLS layer. Agents pick the lowest-latency relay via ztna netcheck.
STUN
UDP 3478. Used by agents to discover their public endpoint before attempting P2P. STUN servers are embedded in the DERP host fleet.
Go agent
Four binaries built from the same tree:
CLI binary (ztna). 39 commands.
Windows Service with workforce monitoring (build tag workforce).
Wails v2 GUI (Windows systray).
Alternative GUI binary.
Go 1.24+ required — we use stdlib crypto/mlkem.
Three variants (client/, client-core/,
client-enterprise/) share identical PQC code.
Install flow
install.sh/install.ps1detect OS + arch- Query
/api/client-versionwithaction: "check",platform: "linux-amd64"→ get release URL - Download binary from
/api/releases/* - Verify SHA-256 checksum
- Install as systemd / launchd / Windows service
- If
ZTNA_AUTH_KEYset: runztna login --auth-key+ztna up
The client_versions DB table controls which binary is "latest" — flag a row is_latest = true to publish.
Authentication flows
Browser login (SPA)
POST /api/auth/loginwith email + password- Response: access_token + refresh_token (body) +
Set-Cookie: __Host-refresh_token - SPA stores access_token in memory only
- On 401:
POST /api/auth/refresh(cookie auto-sent) → new access token - On logout:
POST /api/auth/logoutclears server-side refresh + cookie
Agent registration
POST /api/register-machinewith auth key in Authorization header- Server validates auth key, generates
node_key, allocates tailnet IP - Agent stores
node_keylocally (~/.config/ztna/) and uses it for all future calls - Agent heartbeats every 60s via
POST /api/machine-heartbeat
CLI login (browser-backed)
ztna loginstarts local HTTP server on random port → opens browser- User logs in via SSO / password
- SPA calls
/api/authwithaction: cli_complete→ one-time code - SPA redirects to
http://127.0.0.1:PORT?code=... - Local server exchanges code for JWT pair + saves to disk
Two-server HA topology
Docker containers on PROD (10 total):
Production API (:3000, host network)
Reverse proxy + TLS (:80/:443)
PostgreSQL PRIMARY (:5432)
Connection pooler (:6432)
Valkey PRIMARY (:6379)
Prometheus, Grafana, Loki, Alertmanager, Blackbox exporter (localhost only)
Firewall (UFW): only :80 and :443
accept external traffic. PostgreSQL and API are not directly reachable.
Backups: pg_dump at 03:00 UTC daily → local
retention 7 days → upload to R2 (s3://quickztna/backups/postgres/).
Prometheus alerts if backup is stale >25h or fails.
Repository layout
src/ # React frontend (Vite + shadcn/ui)
backend/ # Node.js API (Hono + PostgreSQL + Valkey)
ztna/client/ # Go CLI (Go 1.24+, native crypto/mlkem)
ztna/client-core/ # Go client — core variant
ztna/client-enterprise/ # Go client — enterprise variant
ztna/server/ # DERP + STUN Docker stack
public/install.sh # Linux/macOS installer
public/install.ps1 # Windows installer
docker-compose.yml
Caddyfile
vite.config.ts See also
- Security — the crypto side of the data plane in detail
- API Reference — every endpoint the control plane exposes
- CLI Reference — every command the agent supports