We're applying a planned hardware upgrade to the primary application tier. The site is temporarily offline. This page is generated from the active runbook by our operations tooling — every line of it reflects the current state of the upgrade.
app-01..app-066.6.32-lts| Host | Stage | Started | Status |
|---|---|---|---|
app-01 | boot · verify · re-enable | 02:01 UTC | healthy |
app-02 | boot · verify · re-enable | 02:34 UTC | healthy |
app-03 | install firmware | 03:08 UTC | in progress |
app-04 | queued | — | queued |
app-05 | queued | — | queued |
app-06 | queued | — | queued |
/healthz 200 OK for three consecutive ticks (15s apart)The web frontend, API and admin tools return 503 for the duration of the work. Background jobs and email are queued and will process automatically once the cluster is healthy. SSO logins federate to a read-only follower so existing tokens keep working for read-only operations.
If smoke tests fail or replication lag exceeds 30 seconds at the end of a host upgrade, the runbook executes:
$ ./rollback.sh --host app-03 --to revision app-03-pre-2026-q2-04 → boot previous kernel from grub fallback entry → restore firmware to cobaltspire-s12-rev-b.bin → rejoin host to cluster as a follower → alert on-call ops, page secondary on-call
| Check | Latest | Trend |
|---|---|---|
| healthz (app-01) | 200 | steady |
| healthz (app-02) | 200 | steady |
| healthz (app-03) | no route | expected (rebooting) |
| postgres lag | 0.4 s | steady |
| NATS round-trip | 1.6 ms | steady |
| tls handshake p95 | 42 ms | improving |
| cdn hit ratio | 0.94 | steady |
The new storage generation requires firmware that only takes effect on a cold reboot. Our maintenance window for this region is 02:00–04:00 UTC, which is the early hours for European customers and late evening for North American ones — the lowest-traffic slot we can pick.
No. Drafts autosave every 90 seconds to an independent durable store and are decoupled from the main API tier. They'll be there when you sign back in.
Exports queued before the window will start once the cluster is healthy and will email you when ready. Exports queued during the window are accepted by the API gateway (which returns a job ID) and queued for execution after.
Inbound webhooks return 503 and are expected to be retried by your end. Outbound webhooks pause and resend with exponential backoff once we're up; we retry endpoints returning 5xx for up to 24 hours.