When ArgoCD shows Healthy but Keycloak silently strips JWT claims

DEV Community

Snapshot live state before any reconciliation. The live API is now the source of truth, not the ConfigMap.

Diffing live-clients.json against the clients block in configmap-realm.json showed six clients with material differences. Two were missing protocol mappers entirely. Three had client scopes that had been removed. One had role mappings that were present in the ConfigMap but missing in production, which told us that client had also been changed in the console at some point and the change had been overwritten on a previous sync we had not even noticed. That last finding was the one that mattered most: this was not the first time the OVERWRITE strategy had quietly destroyed live config. It was just the first time the destruction had cascaded far enough to break downstream services.

Two write paths to the same realm. OVERWRITE makes one of them silently win.

Two write paths to the same realm. OVERWRITE makes one of them silently win.

Reconstructing realm state without invalidating active sessions

Why we did not re-import the ConfigMap

The obvious recovery path was to fix the realm JSON in git, commit it, and let ArgoCD re-sync. We did not do that, and the reason matters. A full realm re-import, even with the right content, runs through the Keycloak realm import flow on startup. Depending on the chart and the Keycloak version, that can rotate signing keys, drop active sessions, or invalidate refresh tokens. We had roughly 8,000 active user sessions at that moment. Forcing all of them to re-authenticate at 11pm during an active incident was not a recovery; it was a second outage on top of the first.

So we split the fix into two phases. Phase one was to restore live realm state using the Admin REST API directly, client by client, mapper by mapper. The REST API can add a protocol mapper or attach a client scope to a client without bouncing anything. Phase two was to update the ConfigMap in git to match the now-correct live state AND change the import strategy, so that the next ArgoCD sync would be a no-op rather than another OVERWRITE pass.

# Phase 1: restore each missing mapper live via Admin REST API
# Example: re-add the groups protocol mapper to auth-service client
CLIENT_ID=$(jq -r '.[] | select(.clientId=="auth-service") | .id' live-clients.json)
curl -s -X POST \
 -H "Authorization: Bearer $TOKEN" \
 -H "Content-Type: application/json" \
 "$KC/admin/realms/primary/clients/$CLIENT_ID/protocol-mappers/models" \
 -d '{
 "name": "groups",
 "protocol": "openid-connect",
 "protocolMapper": "oidc-group-membership-mapper",
 "config": {
 "claim.name": "groups",
 "full.path": "false",
 "id.token.claim": "true",
 "access.token.claim": "true",
 "userinfo.token.claim": "true"
 }
 }'
# Verify a freshly issued token now carries the claim before moving on
curl -s -X POST "$KC/realms/primary/protocol/openid-connect/token" \
 -d 'grant_type=client_credentials' \
 -d "client_id=auth-service" -d "client_secret=$SECRET" \
 | jq -r .access_token | cut -d. -f2 | base64 -d 2>/dev/null | jq .

Restore each mapper live, then verify the issued token actually carries the claim before moving to the next client.

We worked through the six clients in dependency order: auth-service first because every other service consumed its tokens, then the api gateway, then profile, then the rest. After each client we curl'd a fresh token and base64-decoded the payload to confirm the claim was present. Twenty-two minutes from the start of restoration, timeline-service was returning 200s again. No sessions dropped. No users re-authenticated. The Keycloak pods were never restarted.

What we changed so the next sync becomes a no-op

The one Helm value that should never be OVERWRITE

With live state correct, the dangerous artifact in the system was still the stale realm JSON in the ConfigMap and the OVERWRITE strategy that would re-apply it on any future sync. We exported the now-correct realm via the Admin API, ran it through a diff against what was in git, and committed the result. We also patched the Keycloak Helm values to set the realm import strategy to IGNORE_EXISTING.

# values.yaml for the Keycloak chart
extraEnv: |
 - name: KEYCLOAK_IMPORT_STRATEGY
 value: IGNORE_EXISTING
 # On Keycloak 22+ via Quarkus distribution:
 - name: KC_SPI_IMPORT_SINGLE_FILE_STRATEGY
 value: IGNORE_EXISTING
# For the operator/CR variant:
# spec:
# realmImport:
# strategy: IGNORE_EXISTING # NOT OVERWRITE_EXISTING

IGNORE_EXISTING means the ConfigMap seeds a realm on first creation but never overwrites existing resources. This is the correct setting for any realm that humans also edit.

We re-enabled ArgoCD auto-sync and watched it run. The sync diffed clean: ConfigMap content matched live realm, import strategy was IGNORE_EXISTING, no resources were touched. Green for the right reason this time.

We changed two things in the way the team operates going forward. First, we wrote a small drift detector that runs nightly. It pulls the live realm via the Admin API, diffs it against the realm JSON in git, and posts to a Slack channel if they disagree. It is roughly 80 lines and it has caught two console-edits-not-committed in the six weeks since. Second, we now treat OVERWRITE as a forbidden value for any realm that is also editable in the admin console. If you want OVERWRITE semantics, you must also remove admin console write access for everyone except a break-glass account, because otherwise you are building a system where one of two writers silently destroys the other's work. We have written more about this category of GitOps failure in the ArgoCD and GitOps recovery cluster, and the same pattern shows up with Grafana dashboards, Argo Workflows templates, and anything else where humans and a controller both have write access to the same object.

When GitOps is silently rewriting your identity provider

If your realm config and your cluster disagree

The hard part of this kind of incident is not the Keycloak knowledge. It is recognizing that a green ArgoCD dashboard can coexist with a destroyed production configuration, and knowing which fixes preserve sessions versus which ones lock out every user in the building at midnight. The team we worked with had the Keycloak skills. What they did not have was a recovery sequence that prioritized live state capture over git reconciliation, and a clear rule about when to apply via the Admin API versus when to let ArgoCD do it.

We run these recovery engagements every week. The OVERWRITE-vs-IGNORE_EXISTING trap has hit two other teams this quarter, both on Keycloak, and we have seen the same shape on Grafana provisioning, Argo Workflows ClusterWorkflowTemplates, and a memorable case with Vault policies. The pattern is always: controller writes, human writes, controller wins on the next reconcile, nobody notices for hours.

If your identity provider, your dashboards, or any other system with human-editable state is sitting behind ArgoCD and you have ever wondered whether you are quietly losing changes, book an infrastructure review with our team and we will be on a bridge with you the same day. The first 30 minutes will tell you whether you have a drift problem, and from there we can scope a recovery that does not require kicking your users out.

Originally published at https://infraforge.agency/insights/keycloak-realm-overwrite-argocd-sync-drift/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.