Server upgrade in progress

CHG-2026-Q2-04 runbook · ops-2026-Q2-04 · rev 7

We're applying a planned hardware upgrade to the primary application tier. The site is temporarily offline. This page is generated from the active runbook by our operations tooling — every line of it reflects the current state of the upgrade.

6
hosts in scope
2
hosts done
1
host running
~ 65m
eta remaining

What we're doing

Host matrix

HostStageStartedStatus
app-01boot · verify · re-enable02:01 UTChealthy
app-02boot · verify · re-enable02:34 UTChealthy
app-03install firmware03:08 UTCin progress
app-04queuedqueued
app-05queuedqueued
app-06queuedqueued

Step-by-step plan (per host)

  1. drain workloads via load balancer, wait for connections < 5
  2. snapshot persistent volumes, verify hash on storage controller
  3. install firmware blob (cobaltspire-x22-rev-c.bin)
  4. swap kernel image to 6.6.32-lts, regenerate initramfs
  5. reboot host, wait for SSH on management network
  6. verify /healthz 200 OK for three consecutive ticks (15s apart)
  7. run smoke tests: postgres replication lag, NATS round-trip, TLS handshake
  8. re-enable traffic at the load balancer, ramp 10% → 100% over 6 minutes
  9. watch dashboards for 10 minutes, alert on regression

Customer impact

The web frontend, API and admin tools return 503 for the duration of the work. Background jobs and email are queued and will process automatically once the cluster is healthy. SSO logins federate to a read-only follower so existing tokens keep working for read-only operations.

Estimated time to recovery: 60–90 minutes. We do not expect any data loss; snapshots have been taken for every host and verified by the storage controller.

Rollback plan

If smoke tests fail or replication lag exceeds 30 seconds at the end of a host upgrade, the runbook executes:

$ ./rollback.sh --host app-03 --to revision app-03-pre-2026-q2-04
→ boot previous kernel from grub fallback entry
→ restore firmware to cobaltspire-s12-rev-b.bin
→ rejoin host to cluster as a follower
→ alert on-call ops, page secondary on-call
A full rollback restores the cluster to the previous generation within ~12 minutes per host. Customer data is never at risk during rollback; only the new firmware and kernel are reverted.

Status pings

CheckLatestTrend
healthz (app-01)200steady
healthz (app-02)200steady
healthz (app-03)no routeexpected (rebooting)
postgres lag0.4 ssteady
NATS round-trip1.6 mssteady
tls handshake p9542 msimproving
cdn hit ratio0.94steady

FAQ

Why is this happening during the day?

The new storage generation requires firmware that only takes effect on a cold reboot. Our maintenance window for this region is 02:00–04:00 UTC, which is the early hours for European customers and late evening for North American ones — the lowest-traffic slot we can pick.

Will I lose draft data?

No. Drafts autosave every 90 seconds to an independent durable store and are decoupled from the main API tier. They'll be there when you sign back in.

What about queued exports?

Exports queued before the window will start once the cluster is healthy and will email you when ready. Exports queued during the window are accepted by the API gateway (which returns a job ID) and queued for execution after.

Are integrations affected?

Inbound webhooks return 503 and are expected to be retried by your end. Outbound webhooks pause and resend with exponential backoff once we're up; we retry endpoints returning 5xx for up to 24 hours.

Contact

On-call · ops eastops-east@lankworth-tlm.io · paged
On-call · ops westops-west@lankworth-tlm.io · backup
Status pagestatus.lankworth-tlm.io
Customer successcs@lankworth-tlm.io · best-effort
— Site Reliability · Lankworth Telematicspage · sre-pager-east · rev. 7