Skip to content

Troubleshooting

A cross-cutting, symptom-first guide to the things that actually go wrong on a running deployment. The individual runbooks each carry an "If something goes wrong" section for their own step; this page collects the failures that span steps — where the cause is in one component and the symptom shows up in another — so you can start from what you're seeing and work back to the fix.

For each symptom: the likely cause, how to check, and the fix (with a link to the runbook or reference that owns the detail).


A client can't connect

The user opens the app or the web client and never reaches the map — a TLS warning, a login error, or a silent hang.

Likely cause How to check Fix
Certificate not trusted — the client doesn't have the Root CA, or the server presents only its leaf, not the chain. Connecting shows an "untrusted" / certificate warning before any login screen. Distribute the Root CA to clients and make sure the server presents leaf + intermediate (not just the leaf). See Set up the certificate chain → If something goes wrong.
Hostname mismatch — the server's certificate SAN doesn't list the name the client dialed. A hostname-mismatch TLS error naming the host. Reissue the server leaf with the right hostname in the SAN. See Set up the certificate chain.
DNS hasn't propagated / Caddy hasn't issued TLS — the site or the Directory is unreachable. dig <hostname> doesn't resolve to the static IP from tofu output. Add the A record and wait for propagation; Caddy can't get a Let's Encrypt cert until DNS resolves. See Stand up the Directory and Stand up the web client.
OIDC login fails (web) — wrong issuer URL, or the OIDC client secret / web service token wasn't uploaded. Login redirect loops, or fails closed; the bedrock-<env>-webauth-client-secret secret is missing, or the revocation cache can't prime. Check WAYPOINT_OIDC_ISSUER_URL / directory_url points at the Directory, and that both the OIDC client secret and the web service token are uploaded. See Stand up the web client → If something goes wrong.
User isn't onboarded — "user not found" / access denied. Login reaches the Directory but is rejected for that person. Onboard them first. See Onboard an operator.
The user was revoked — their tokens are now refused everywhere. Their Directory page shows a red revoked badge. Revocation is permanent. If it was a mistake, onboard them again from scratch. See Revoke a principal and A device or operator can't be revoked below.

The authoritative gate on the client→router hop is server-cert TLS plus app-layer envelope verify — the client presents no client cert. If TLS is clean but messages still don't flow, the problem is upstream of connection: see the next two sections. (See Security model → Transport TLS.)


Messages aren't arriving

The client connects and the map loads, but positions, chat, or drawings from other users never show up — or only some do.

Likely cause How to check Fix
No server covers the user's cell — routing is cell-first; a router only relays traffic for the geohash-5 cells in its ServerToken.coverage_cells, and denies the rest by ACL. Affected users are operating in an area whose cell isn't in any server's coverage list. The web map is empty / no other users appear. Add the missing cell(s) to the server's coverage and reissue its ServerToken, then restart the box. See Add a server → How geofencing works and the wire protocol key namespace.
The web client has no relay configuredzenoh_router_endpoints is empty, so the browser fails closed with no map data. The web zenoh_router_endpoints Terraform variable is blank or points at the wrong WSS address. Set it to a live server's WSS endpoint (e.g. wss://…:7447) and re-apply. See Stand up the web client → Server / relay.
Classification floor is dropping the message — the publish-side gate denies any payload above min(sender ceiling, server ceiling). Traffic at a higher classification than the server's ceiling silently doesn't relay; the server audit shows a classification denial. The relay's Classification ceiling is set too low. It's a field on the Create Device form (Unclassified → Top Secret), stored on the device — edit the device to raise it, then Reissue Credentials and redeploy. (A relay left at Unclassified caps all its traffic to Unclassified.) Otherwise, send within the floor. See Add a server → Step 2 and Security model → Classification.
A device just had its group key rotated — devices pick up the new epoch on their next background poll; a brief window can look like dropped traffic. Trouble appears right after a key rotation and clears within a few minutes. Wait for the next poll; the previous epoch stays valid through the bounded grace window. If it persists, see Tokens are rejected. See Rotate keys → If something goes wrong and wire protocol → Group-key rotation.
It's a duplicate / replay drop, not a loss — the receive pipeline rejects stale frames (outside the ±60 s window) and repeated (principal, nonce) pairs. Only old or re-sent frames are missing; live traffic is fine. Expected behaviour, not a fault. See Security model → Receive-side gates.

A server won't federate

Two servers are up but won't link and share traffic — users on one relay can't see users on the other.

Likely cause How to check Fix
Peer certificate not trusted — federation is full mTLS; each server must trust the CA the other's certificate chains to. One or both servers reject the peer's certificate at link time. Make sure both ends share the same Root / intermediate. See Set up the certificate chain → If something goes wrong and wire protocol → Federation.
Peer CN doesn't match node_id — federation pins the peer cert CN to the node's id. The peers connect at TLS but the link is dropped on identity mismatch. Reissue the peer's certificate with the CN set to its node_id. See Security model → Transport TLS.
A server's ServerToken is missing or expired — the box can't authenticate itself into the mesh. The server starts but rejects traffic, or won't carry some messages. Reissue the ServerToken from the device's page (Reissue Credentials), redeploy all three files, and restart. See Add a server → If something goes wrong.
The box won't start at all — usually TLS certs or the encrypted data-directory mount. The server process exits on boot; health check on 9090 never answers. Check the TLS cert/key paths and the data-directory mount first. See Add a server → If something goes wrong.

Forwarded IdentityTokens are byte-identical across hops, so the Directory signature stays verifiable end-to-end — routers never re-sign identity. If federation is up but identity claims are being rejected across the link, that's a token problem, not a federation one — see the next section.


Tokens are rejected

A connection or message is refused at the identity gate — the token doesn't verify, has expired, or signs under the wrong key.

Likely cause How to check Fix
Token expiredexpires_at_ms is in the past (gate 3). The principal worked until a deadline, then stopped. Service devices that don't self-refresh need reissuing. For a service device (server/gateway/web token) Reissue Credentials from its page; operators get fresh passes on next login. See Add a server → Step 3 and Security model → Receive-side gates.
Clock skewissued_at_ms is outside the ±60 s replay window (gate 2). Fresh frames from one host are rejected as stale/future while others work. Fix the host's clock (NTP). The window is fixed at 60 s for fresh frames. See Security model → Receive-side gates.
Signing key rotated, key set stale — the Directory signed under a fresh key the verifier hasn't cached yet. Rejections start right after a signing-key rotation; the old key is still served until its tokens expire. This resolves as verifiers refresh the Directory key set. The Directory keeps the previous signing key served until its tokens expire — don't force-expire it early. See Rotate keys.
Group-key epoch retired — a device is still using a previous epoch past its grace cutoff. Trouble persists well past a group-key rotation (beyond the 1–60 min grace). The device must pick up the current epoch. See Rotate keys → If something goes wrong and wire protocol → Group-key rotation.
Wrong / missing service token on a box — the server or web client has no valid IdentityToken to authenticate to the Directory. The web revocation cache can't prime (login fails closed); a server can't poll /api/group-key or /api/revoked-principals. Reissue and redeploy the service IdentityToken. For the web client, mint a new one from Devices and re-upload. See Stand up the web client.

The Directory's public signing key is served at /api/.well-known/directory-key — confirming that endpoint returns the expected key is the first check when token verification fails broadly. (See Stand up the Directory → How to know it worked.)


A device or operator can't be revoked

You're trying to cut off access and the revoke action isn't available, or the revocation doesn't seem to take effect.

Likely cause How to check Fix
The Revoke button is greyed out — the system won't let you revoke the last active principal or device, to prevent a full lockout. The button is disabled on the page. Add another active principal or device first, then revoke this one. See Revoke a principal.
The Revoke button is missing entirely — either it's already revoked, or you're viewing as a non-admin. Check for a red revoked badge; if it's not there, your login may lack the Admin role. If already revoked, you're done. Otherwise ask an Admin — only the Admin role sees the Revoke control. See Revoke a principal → If something goes wrong.
You're looking under the wrong menu — people are under Principals, devices are under Devices (both top-level). You can't find the record because you're in the wrong list. Principals and Devices are separate top-level menu items. Pick the right one. See Revoke a principal and Onboard a device.
Revocation hasn't reached the servers yet — the change is instant in the Directory, but servers enforce it on their next poll of the revoked-principals feed. The principal still appears connected for a short window after you revoke. No action needed — within the revocation poll window every server refuses the tokens, and the server force-closes any active session on a revoked-list update. See Revoke a principal → How to know it worked and Security model → Revocation.
You need to revoke only one device, not the operator — a device key is compromised but the person stays active. You want to keep the operator and kill one device. The RevocationList carries both levels — revoke the device (under Devices) to drop that device's sign-key without touching the operator. See Security model → Revocation.

Revocation is one-way and permanent — there is no un-revoke. If you revoke the wrong record, the only path back is to onboard them again from scratch (a fresh profile, re-register their security key). See Revoke a principal.


See also


Verified against directory@e8287cd / server@ab688f0 / web@80e3ec2 on 2026-06-07 — symptoms consolidated from the runbooks' "If something goes wrong" sections; cross-cutting mechanism cross-checked against security/model.md (receive-side gates, revocation two levels + session force-close server/src/audit.rs, classification PEP server/src/classification_gate.rs) and protocol/wire-protocol.md (cell-coverage ACL, server-cert TLS / federation mTLS CN-pinning server/src/active_peers.rs, group-key grace window).