Server upgrade in progress

CHG-2026-Q2-04 runbook · ops-2026-Q2-04 · rev 7

We're applying a planned hardware upgrade to the primary application tier. The site is temporarily offline. This page is generated from the active runbook by our operations tooling — every line of it reflects the current state of the upgrade.

hosts in scope

hosts done

host running

~ 65m

eta remaining

What we're doing

Replacing NVMe storage on hosts app-01..app-06
Rolling kernel update to 6.6.32-lts
Switching network uplinks from 10G to 25G
Re-cabling top-of-rack switches (rack B-3, B-4)
Rotating the kubelet identity secret in the dataplane

Host matrix

Host	Stage	Started	Status
`app-01`	boot · verify · re-enable	02:01 UTC	healthy
`app-02`	boot · verify · re-enable	02:34 UTC	healthy
`app-03`	install firmware	03:08 UTC	in progress
`app-04`	queued	—	queued
`app-05`	queued	—	queued
`app-06`	queued	—	queued

Step-by-step plan (per host)

drain workloads via load balancer, wait for connections < 5
snapshot persistent volumes, verify hash on storage controller
install firmware blob (cobaltspire-x22-rev-c.bin)
swap kernel image to 6.6.32-lts, regenerate initramfs
reboot host, wait for SSH on management network
verify /healthz 200 OK for three consecutive ticks (15s apart)
run smoke tests: postgres replication lag, NATS round-trip, TLS handshake
re-enable traffic at the load balancer, ramp 10% → 100% over 6 minutes
watch dashboards for 10 minutes, alert on regression

Customer impact

The web frontend, API and admin tools return 503 for the duration of the work. Background jobs and email are queued and will process automatically once the cluster is healthy. SSO logins federate to a read-only follower so existing tokens keep working for read-only operations.

Estimated time to recovery: 60–90 minutes. We do not expect any data loss; snapshots have been taken for every host and verified by the storage controller.

Rollback plan

If smoke tests fail or replication lag exceeds 30 seconds at the end of a host upgrade, the runbook executes:

$ ./rollback.sh --host app-03 --to revision app-03-pre-2026-q2-04
→ boot previous kernel from grub fallback entry
→ restore firmware to cobaltspire-s12-rev-b.bin
→ rejoin host to cluster as a follower
→ alert on-call ops, page secondary on-call

A full rollback restores the cluster to the previous generation within ~12 minutes per host. Customer data is never at risk during rollback; only the new firmware and kernel are reverted.

Status pings

Check	Latest	Trend
healthz (app-01)	200	steady
healthz (app-02)	200	steady
healthz (app-03)	no route	expected (rebooting)
postgres lag	0.4 s	steady
NATS round-trip	1.6 ms	steady
tls handshake p95	42 ms	improving
cdn hit ratio	0.94	steady

FAQ

Why is this happening during the day?

The new storage generation requires firmware that only takes effect on a cold reboot. Our maintenance window for this region is 02:00–04:00 UTC, which is the early hours for European customers and late evening for North American ones — the lowest-traffic slot we can pick.

Will I lose draft data?

No. Drafts autosave every 90 seconds to an independent durable store and are decoupled from the main API tier. They'll be there when you sign back in.

What about queued exports?

Exports queued before the window will start once the cluster is healthy and will email you when ready. Exports queued during the window are accepted by the API gateway (which returns a job ID) and queued for execution after.

Are integrations affected?

Inbound webhooks return 503 and are expected to be retried by your end. Outbound webhooks pause and resend with exponential backoff once we're up; we retry endpoints returning 5xx for up to 24 hours.

Contact

On-call · ops eastops-east@lankworth-tlm.io · paged

On-call · ops westops-west@lankworth-tlm.io · backup

Status pagestatus.lankworth-tlm.io

Customer successcs@lankworth-tlm.io · best-effort

— Site Reliability · Lankworth Telematicspage · sre-pager-east · rev. 7