In case you stumble upon this error while troubleshooting your cluster agent suddenly being disconnected keep reading, because after sitting with this issue for far too long I finally found and fixed the root cause.
Simply, we had a cluster where the cluster agent repeatedly failed to connect to Rancher, no matter if the node was rebooted, reinstalled or recreated. The only errors available in the cattle-cluster-agent pods was the following:
E1128 22:14:04.027989 39 reflector.go:158] "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: Failed to watch *v1.ClusterRepo: failed to list *v1.ClusterRepo: parsing time \"2024-01-01T22:42:00\" as \"2006-01-02T15:04:05Z07:00\": cannot parse \"\" as \"Z07:00\"" logger="UnhandledError"
W1128 22:14:46.175887 39 reflector.go:561] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: failed to list *v1.ClusterRepo: parsing time "2024-01-01T22:42:00" as "2006-01-02T15:04:05Z07:00": cannot parse "" as "Z07:00"
The solution was quite simple, if you run a command searching for the offending datestring
kubectl list clusterrepo -o yaml | grep "2024-01-01T22:42:00" -A10 -B10
in one of the ClusterRepo configurations you’d see the following:
kind: ClusterRepo
metadata:
annotations:
field.cattle.io/description: https://charts.gitlab.io
creationTimestamp: "2023-02-22T10:53:59Z"
generation: 2
name: gitlab
resourceVersion: "500050112"
uid: c2ebaa0d-be50-4e8d-ab7b-2f7385148a04
spec:
forceUpdate: 2024-01-01T22:42:00
url: https://charts.gitlab.io
status:
conditions:
- lastUpdateTime: "2023-02-22T10:53:59Z"
status: "True"
type: FollowerDownloaded
- lastUpdateTime: "2024-09-12T21:23:10Z"
status: "True"
type: Downloaded
downloadTime: "2024-09-12T21:23:09Z"
The field at spec.forceUpdate contains an incorrectly formatted date, which causes the cluster agent to panic before fully starting up. The solution is quite simple, you can just add a Z after the forceUpdate time string
forceUpdate: 2024-01-01T22:42:00Z
This will resolve the issue Rancher is having with the format. This error may be in multiple clusterrepos, so you can fix all of them by editing all clusterrepos and fixing any incorrect times in this field
kubectl edit clusterrepo
Once done, you can manually restart one of the cluster-agent pods, and the agent should once again connect to Rancher.