Troubleshooting¶
A cross-cutting, symptom-first guide to the things that actually go wrong on a running deployment. The individual runbooks each carry an "If something goes wrong" section for their own step; this page collects the failures that span steps — where the cause is in one component and the symptom shows up in another — so you can start from what you're seeing and work back to the fix.
For each symptom: the likely cause, how to check, and the fix (with a link to the runbook or reference that owns the detail).
A client can't connect¶
The user opens the app or the web client and never reaches the map — a TLS warning, a login error, or a silent hang.
| Likely cause | How to check | Fix |
|---|---|---|
| Certificate not trusted — the client doesn't have the Root CA, or the server presents only its leaf, not the chain. | Connecting shows an "untrusted" / certificate warning before any login screen. | Distribute the Root CA to clients and make sure the server presents leaf + intermediate (not just the leaf). See Set up the certificate chain → If something goes wrong. |
| Hostname mismatch — the server's certificate SAN doesn't list the name the client dialed. | A hostname-mismatch TLS error naming the host. | Reissue the server leaf with the right hostname in the SAN. See Set up the certificate chain. |
| DNS hasn't propagated / Caddy hasn't issued TLS — the site or the Directory is unreachable. | dig <hostname> doesn't resolve to the static IP from tofu output. |
Add the A record and wait for propagation; Caddy can't get a Let's Encrypt cert until DNS resolves. See Stand up the Directory and Stand up the web client. |
| OIDC login fails (web) — wrong issuer URL, or the OIDC client secret / web service token wasn't uploaded. | Login redirect loops, or fails closed; the bedrock-<env>-webauth-client-secret secret is missing, or the revocation cache can't prime. |
Check WAYPOINT_OIDC_ISSUER_URL / directory_url points at the Directory, and that both the OIDC client secret and the web service token are uploaded. See Stand up the web client → If something goes wrong. |
| User isn't onboarded — "user not found" / access denied. | Login reaches the Directory but is rejected for that person. | Onboard them first. See Onboard an operator. |
| The user was revoked — their tokens are now refused everywhere. | Their Directory page shows a red revoked badge. | Revocation is permanent. If it was a mistake, onboard them again from scratch. See Revoke a principal and A device or operator can't be revoked below. |
The authoritative gate on the client→router hop is server-cert TLS plus app-layer envelope verify — the client presents no client cert. If TLS is clean but messages still don't flow, the problem is upstream of connection: see the next two sections. (See Security model → Transport TLS.)
Messages aren't arriving¶
The client connects and the map loads, but positions, chat, or drawings from other users never show up — or only some do.
| Likely cause | How to check | Fix |
|---|---|---|
No server covers the user's cell — routing is cell-first; a router only relays traffic for the geohash-5 cells in its ServerToken.coverage_cells, and denies the rest by ACL. |
Affected users are operating in an area whose cell isn't in any server's coverage list. The web map is empty / no other users appear. | Add the missing cell(s) to the server's coverage and reissue its ServerToken, then restart the box. See Add a server → How geofencing works and the wire protocol key namespace. |
The web client has no relay configured — zenoh_router_endpoints is empty, so the browser fails closed with no map data. |
The web zenoh_router_endpoints Terraform variable is blank or points at the wrong WSS address. |
Set it to a live server's WSS endpoint (e.g. wss://…:7447) and re-apply. See Stand up the web client → Server / relay. |
Classification floor is dropping the message — the publish-side gate denies any payload above min(sender ceiling, server ceiling). |
Traffic at a higher classification than the server's ceiling silently doesn't relay; the server audit shows a classification denial. | The relay's Classification ceiling is set too low. It's a field on the Create Device form (Unclassified → Top Secret), stored on the device — edit the device to raise it, then Reissue Credentials and redeploy. (A relay left at Unclassified caps all its traffic to Unclassified.) Otherwise, send within the floor. See Add a server → Step 2 and Security model → Classification. |
| A device just had its group key rotated — devices pick up the new epoch on their next background poll; a brief window can look like dropped traffic. | Trouble appears right after a key rotation and clears within a few minutes. | Wait for the next poll; the previous epoch stays valid through the bounded grace window. If it persists, see Tokens are rejected. See Rotate keys → If something goes wrong and wire protocol → Group-key rotation. |
It's a duplicate / replay drop, not a loss — the receive pipeline rejects stale frames (outside the ±60 s window) and repeated (principal, nonce) pairs. |
Only old or re-sent frames are missing; live traffic is fine. | Expected behaviour, not a fault. See Security model → Receive-side gates. |
A server won't federate¶
Two servers are up but won't link and share traffic — users on one relay can't see users on the other.
| Likely cause | How to check | Fix |
|---|---|---|
| Peer certificate not trusted — federation is full mTLS; each server must trust the CA the other's certificate chains to. | One or both servers reject the peer's certificate at link time. | Make sure both ends share the same Root / intermediate. See Set up the certificate chain → If something goes wrong and wire protocol → Federation. |
Peer CN doesn't match node_id — federation pins the peer cert CN to the node's id. |
The peers connect at TLS but the link is dropped on identity mismatch. | Reissue the peer's certificate with the CN set to its node_id. See Security model → Transport TLS. |
| A server's ServerToken is missing or expired — the box can't authenticate itself into the mesh. | The server starts but rejects traffic, or won't carry some messages. | Reissue the ServerToken from the device's page (Reissue Credentials), redeploy all three files, and restart. See Add a server → If something goes wrong. |
| The box won't start at all — usually TLS certs or the encrypted data-directory mount. | The server process exits on boot; health check on 9090 never answers. |
Check the TLS cert/key paths and the data-directory mount first. See Add a server → If something goes wrong. |
Forwarded IdentityTokens are byte-identical across hops, so the Directory signature stays
verifiable end-to-end — routers never re-sign identity. If federation is up but identity
claims are being rejected across the link, that's a token problem, not a federation one — see
the next section.
Tokens are rejected¶
A connection or message is refused at the identity gate — the token doesn't verify, has expired, or signs under the wrong key.
| Likely cause | How to check | Fix |
|---|---|---|
Token expired — expires_at_ms is in the past (gate 3). |
The principal worked until a deadline, then stopped. Service devices that don't self-refresh need reissuing. | For a service device (server/gateway/web token) Reissue Credentials from its page; operators get fresh passes on next login. See Add a server → Step 3 and Security model → Receive-side gates. |
Clock skew — issued_at_ms is outside the ±60 s replay window (gate 2). |
Fresh frames from one host are rejected as stale/future while others work. | Fix the host's clock (NTP). The window is fixed at 60 s for fresh frames. See Security model → Receive-side gates. |
| Signing key rotated, key set stale — the Directory signed under a fresh key the verifier hasn't cached yet. | Rejections start right after a signing-key rotation; the old key is still served until its tokens expire. | This resolves as verifiers refresh the Directory key set. The Directory keeps the previous signing key served until its tokens expire — don't force-expire it early. See Rotate keys. |
| Group-key epoch retired — a device is still using a previous epoch past its grace cutoff. | Trouble persists well past a group-key rotation (beyond the 1–60 min grace). | The device must pick up the current epoch. See Rotate keys → If something goes wrong and wire protocol → Group-key rotation. |
| Wrong / missing service token on a box — the server or web client has no valid IdentityToken to authenticate to the Directory. | The web revocation cache can't prime (login fails closed); a server can't poll /api/group-key or /api/revoked-principals. |
Reissue and redeploy the service IdentityToken. For the web client, mint a new one from Devices and re-upload. See Stand up the web client. |
The Directory's public signing key is served at /api/.well-known/directory-key — confirming
that endpoint returns the expected key is the first check when token verification fails broadly.
(See Stand up the Directory → How to know it worked.)
A device or operator can't be revoked¶
You're trying to cut off access and the revoke action isn't available, or the revocation doesn't seem to take effect.
| Likely cause | How to check | Fix |
|---|---|---|
| The Revoke button is greyed out — the system won't let you revoke the last active principal or device, to prevent a full lockout. | The button is disabled on the page. | Add another active principal or device first, then revoke this one. See Revoke a principal. |
| The Revoke button is missing entirely — either it's already revoked, or you're viewing as a non-admin. | Check for a red revoked badge; if it's not there, your login may lack the Admin role. | If already revoked, you're done. Otherwise ask an Admin — only the Admin role sees the Revoke control. See Revoke a principal → If something goes wrong. |
| You're looking under the wrong menu — people are under Principals, devices are under Devices (both top-level). | You can't find the record because you're in the wrong list. | Principals and Devices are separate top-level menu items. Pick the right one. See Revoke a principal and Onboard a device. |
| Revocation hasn't reached the servers yet — the change is instant in the Directory, but servers enforce it on their next poll of the revoked-principals feed. | The principal still appears connected for a short window after you revoke. | No action needed — within the revocation poll window every server refuses the tokens, and the server force-closes any active session on a revoked-list update. See Revoke a principal → How to know it worked and Security model → Revocation. |
| You need to revoke only one device, not the operator — a device key is compromised but the person stays active. | You want to keep the operator and kill one device. | The RevocationList carries both levels — revoke the device (under Devices) to drop that device's sign-key without touching the operator. See Security model → Revocation. |
Revocation is one-way and permanent — there is no un-revoke. If you revoke the wrong record, the only path back is to onboard them again from scratch (a fresh profile, re-register their security key). See Revoke a principal.
See also¶
- Before you begin — prerequisites, roles, and the end-to-end checklist.
- Stand up a deployment — the deployment order and the four "is it up?" checks.
- Security model — the receive-side gates, revocation, and classification.
- Wire protocol — the cell-first namespace, transport, and group-key rotation.
- Operator training index — every runbook's own "If something goes wrong" section.
Verified against directory@e8287cd / server@ab688f0 / web@80e3ec2 on 2026-06-07 — symptoms consolidated from the runbooks' "If something goes wrong" sections; cross-cutting mechanism cross-checked against security/model.md (receive-side gates, revocation two levels + session force-close server/src/audit.rs, classification PEP server/src/classification_gate.rs) and protocol/wire-protocol.md (cell-coverage ACL, server-cert TLS / federation mTLS CN-pinning server/src/active_peers.rs, group-key grace window).