Longhorn Maintenance Window#
This guide describes a repeatable procedure to stop workloads in a controlled way for Longhorn upgrades and then restore normal operations.
Prerequisites#
kubectl
with access to the clusteryq
to read the maintenance plan- Up-to-date checkout of this repository (
git pull
) so thatscripts/scale-maintenance-plan.yaml
andscripts/scale-maintenance.sh
are available
Note: The affected Argo CD applications ignore replica drift thanks to the
spec.ignoreDifferences
entries. During the maintenance window they will report as "Degraded" but will not attempt to roll replicas back automatically.
Procedure#
-
Check status
Shows the currently configured replicas and any stored original value for each workload. -
Stop workloads
- Stores the current replica count as the annotationmaintenance.k3s-git-ops/original-replicas
. - Scales the Deployments, StatefulSets, and operator-managed CRs (Prometheus / Alertmanager) listed inscripts/scale-maintenance-plan.yaml
down to0
. - The command is idempotent and can be dry-run with--dry-run
. -
Perform the Longhorn upgrade - Upgrade Longhorn (UI or Argo CD). - Afterwards verify that the storage cluster is healthy.
-
Restore workloads
- Reads the stored values and restores the replicas. - Removes the maintenance annotation afterwards. -
Post checks
- Verifies that no maintenance annotations remain (emptySTORED
column). - Trigger a "Refresh" in Argo CD for affected apps to update their health status.
Adjustments#
- Add additional workloads to
scripts/scale-maintenance-plan.yaml
(id
,namespace
,kind
,name
). - Alternate plans (for other clusters, for example) can be provided via
PLAN_FILE=/path/to/plan.yaml
or the--plan-file
flag. - Use
--dry-run
for a rehearsal without actually touching replicas; the commands will be logged only.
Troubleshooting#
- Annotation missing on restore: Confirm that the scale-down completed successfully; otherwise set the desired replica count manually (for Prometheus / Alertmanager CRs:
kubectl patch ... --type merge -p '{"spec":{"replicas":<n>}}'
) and re-run the script. - Resource not found: Check namespace and name in the plan. Helm releases with prefixes must match exactly.
- Argo CD immediately restores replicas: Confirm that the application includes the
ignoreDifferences
entry and that it has been refreshed after the maintenance window.