Skip to content

Latest commit

 

History

History
591 lines (411 loc) · 18.5 KB

File metadata and controls

591 lines (411 loc) · 18.5 KB

Upgrading Team Operator

This guide covers upgrading the Team Operator: pre-upgrade preparation, upgrade procedures, version-specific migrations, and troubleshooting.

CRD Management (v1.15+)

Starting with v1.15.0, the operator automatically applies its own CRDs at startup using server-side apply. This ensures the CRD schema always matches the running operator binary, even in cases where only the container image is updated without a full Helm chart upgrade (e.g., adhoc images for testing).

The operator uses the --manage-crds flag (default: true) to control this behavior. To opt out (for example, if you manage CRDs via Flux or ArgoCD), set:

controllerManager:
  container:
    args:
      - "--manage-crds=false"

When --manage-crds=false, the operator starts without touching CRDs, and you are responsible for keeping them in sync with the operator version.

Benefits of automatic CRD management:

  • CRDs are always in sync with the operator version
  • Works with adhoc images (e.g., PR branches) without requiring Helm chart changes
  • Uses server-side apply (SSA) which is idempotent and only updates when schema differs
  • No manual CRD management needed for most deployments

When to disable:

  • GitOps workflows (Flux, ArgoCD) that manage CRDs separately
  • Security policies requiring explicit CRD review before application
  • Multi-tenant clusters where CRD updates require approval

RBAC Permissions: The operator requires the following RBAC permissions on its own CRDs:

  • get - to check if CRDs exist
  • patch - to apply schema updates via server-side apply
  • update - to modify CRD metadata

The Helm chart automatically grants these permissions. The operator intentionally omits the delete verb to prevent accidental data loss.

Note on CRD deletion: Because the operator's RBAC omits the delete verb for CRDs, if a future operator version removes a resource type, the now-orphaned CRD will remain in the cluster and must be removed manually:

kubectl delete crd <crd-name>.core.posit.team

Before deleting an orphaned CRD, ensure all custom resources of that type have been removed to avoid losing data:

kubectl get <resource-plural> -A  # verify no instances remain
kubectl delete crd <crd-name>.core.posit.team

Before Upgrading

Backup Procedures

Before performing any upgrade, create backups of critical resources:

1. Backup Custom Resources

# Backup all Site resources
kubectl get sites -A -o yaml > sites-backup.yaml

# Backup all product resources
kubectl get workbenches -A -o yaml > workbenches-backup.yaml
kubectl get connects -A -o yaml > connects-backup.yaml
kubectl get packagemanagers -A -o yaml > packagemanagers-backup.yaml
kubectl get chronicles -A -o yaml > chronicles-backup.yaml
kubectl get flightdecks -A -o yaml > flightdecks-backup.yaml
kubectl get postgresdatabases -A -o yaml > postgresdatabases-backup.yaml

# Backup all Posit Team resources at once
kubectl get sites,workbenches,connects,packagemanagers,chronicles,flightdecks,postgresdatabases -A -o yaml > posit-team-resources-backup.yaml

2. Backup Secrets

# Backup secrets in the Posit Team namespace
kubectl get secrets -n posit-team -o yaml > secrets-backup.yaml

# For sensitive backups, consider encrypting
kubectl get secrets -n posit-team -o yaml | gpg -c > secrets-backup.yaml.gpg

3. Backup Databases

If using external databases for products (Connect, Workbench, Package Manager), back up databases before upgrading. The operator manages PostgresDatabase resources that schema changes may affect.

# List managed databases
kubectl get postgresdatabases -A

# For each database, create a backup using your database backup procedures
# Example for PostgreSQL:
# pg_dump -h <host> -U <user> -d <database> > database-backup.sql

Check Current Version

Verify your current installation:

# Check Helm release version
helm list -n posit-team-system

# Check operator deployment image
kubectl get deployment team-operator-controller-manager -n posit-team-system -o jsonpath='{.spec.template.spec.containers[0].image}'

# Check CRD versions
kubectl get crds | grep posit.team

Review Changelog

Review the CHANGELOG.md for breaking changes between your current version and the target version. Look for:

  • Breaking changes that require configuration updates
  • Deprecated fields that need migration
  • New required fields

Test in Non-Production

Critical: Test upgrades in a non-production environment first:

  1. Create a staging cluster or namespace that mirrors production
  2. Apply the same Site configuration
  3. Perform the upgrade
  4. Verify all products function
  5. Test any automated integrations

Upgrade Methods

Helm Upgrade Procedure

The recommended upgrade method is Helm:

Standard Upgrade

# Update Helm repository (if using external repo)
helm repo update

# View changes before applying
helm diff upgrade team-operator ./dist/chart \
  --namespace posit-team-system \
  --values my-values.yaml

# Perform the upgrade
helm upgrade team-operator ./dist/chart \
  --namespace posit-team-system \
  --values my-values.yaml

Upgrade with Specific Version

helm upgrade team-operator ./dist/chart \
  --namespace posit-team-system \
  --set controllerManager.container.image.tag=v1.2.0 \
  --values my-values.yaml

Upgrade with CRD Updates

CRDs are updated during Helm upgrade when crd.enable: true (default). If you've disabled CRD management:

# Manually apply CRD updates first
kubectl apply -f dist/chart/templates/crd/

# Then upgrade the operator
helm upgrade team-operator ./dist/chart \
  --namespace posit-team-system \
  --values my-values.yaml

Kustomize Upgrade Procedure

If using Kustomize for deployment:

# Update the kustomization.yaml to reference the new version
# Then apply:
kubectl apply -k config/default

# Or for specific overlays:
kubectl apply -k config/overlays/production

CRD Upgrade Considerations

CRDs require attention during upgrades:

  1. CRDs Persist Across Helm Uninstall: By default (crd.keep: true), CRDs remain in the cluster after helm uninstall. This prevents accidental data loss but requires careful CRD management.

  2. CRD Version Compatibility: The operator manages CRDs at API version core.posit.team/v1beta1 (and keycloak.k8s.keycloak.org/v2alpha1 for Keycloak). Your CRs must be compatible with the CRD schema in the new version.

  3. Schema Validation: After CRD updates, existing CRs are validated against the new schema. Invalid CRs may prevent reconciliation.

# Verify CRDs are updated
kubectl get crds sites.core.posit.team -o jsonpath='{.metadata.resourceVersion}'

# Check for validation issues
kubectl get sites -A -o json | jq '.items[] | select(.status.conditions[]?.reason == "InvalidSpec")'

Version-Specific Migrations

v1.15.0

Breaking Change: Database Password Secret Rename

The Kubernetes Secret used to store the database password for each product component has been renamed from <component-name> to <component-name>-db-password.

If you are upgrading an existing installation that has already run the operator against live clusters, you must migrate the existing secrets before upgrading. Otherwise, the operator will create new secrets at the new name with freshly generated passwords, leaving the old secrets orphaned and causing database authentication failures.

Migration steps (run before upgrading the operator):

  1. Identify the components with existing DB password secrets:

    for comp in workbench connect packagemanager; do
      kubectl get secret "${comp}" -n posit-team --ignore-not-found -o name
    done
  2. For each component (workbench, connect, packagemanager), rename the secret:

    Warning: If ${NEW_NAME} already exists in the cluster, do not apply this migration — the operator has already generated a new password and you must re-synchronize the database password manually.

    # Get the old secret data
    OLD_NAME=<component-name>
    NEW_NAME="${OLD_NAME}-db-password"
    NAMESPACE=posit-team
    
    # Create new secret with old data
    kubectl get secret "${OLD_NAME}" -n "${NAMESPACE}" -o json \
      | python3 -c "import json,sys; d=json.load(sys.stdin); d['metadata']['name']='${NEW_NAME}'; [d['metadata'].pop(k,None) for k in ['resourceVersion','uid','creationTimestamp','managedFields','ownerReferences']]; print(json.dumps(d))" \
      | kubectl apply -f -
    
    # Delete old secret
    kubectl delete secret "${OLD_NAME}" -n "${NAMESPACE}"
  3. Proceed with the operator upgrade.

If you are performing a fresh installation or upgrading a cluster that has never had the operator running against it, no migration is needed.

v1.2.0

New Features:

  • Added CreateOrUpdateResource helper for improved reconciliation
  • Post-mutation label validation for Traefik resources

Deprecations:

  • BasicCreateOrUpdate function is deprecated in favor of CreateOrUpdateResource

No configuration changes required for users.

v1.1.0

New Features:

  • Added tolerations and nodeSelector support for controller manager

Migration: If you used workarounds for pod scheduling, update your values:

controllerManager:
  tolerations:
    - key: "node-role.kubernetes.io/control-plane"
      operator: "Exists"
      effect: "NoSchedule"
  nodeSelector:
    kubernetes.io/os: linux

v1.0.4

Bug Fixes:

  • Removed kustomize-adopt hook that could fail on tainted clusters

No migration required.

v1.0.0

Initial Release:

  • Migration from rstudio/ptd repository

If upgrading from the legacy rstudio/ptd operator, contact Posit support for migration assistance.

Known Deprecated Fields

The following fields are deprecated and will be removed in future versions:

CRD Field Replacement Notes
Site spec.secretType spec.secret.type Use the new Secret configuration block
Workbench spec.config.databricks.conf spec.secretConfig.databricks Databricks config moved to SecretConfig
PackageManager spec.config.CRAN N/A PackageManagerCRANConfig is deprecated

Migration Example - Databricks Configuration:

Before (deprecated):

apiVersion: core.posit.team/v1beta1
kind: Workbench
spec:
  config:
    databricks.conf:
      workspace1:
        name: "My Workspace"
        url: "https://workspace.cloud.databricks.com"

After (recommended):

apiVersion: core.posit.team/v1beta1
kind: Site
spec:
  workbench:
    databricks:
      workspace1:
        name: "My Workspace"
        url: "https://workspace.cloud.databricks.com"
        clientId: "<client-id>"

Key Migration

The operator migrates legacy UUID-format and binary-format encryption keys to the new hex256 format. This happens during reconciliation. Monitor logs for migration messages:

kubectl logs -n posit-team-system deployment/team-operator-controller-manager | grep -i "migrating"

Post-Upgrade Verification

1. Check Operator Health

# Verify the operator pod is running
kubectl get pods -n posit-team-system -l control-plane=controller-manager

# Check operator logs for errors
kubectl logs -n posit-team-system deployment/team-operator-controller-manager --tail=100

# Verify health endpoints
kubectl exec -n posit-team-system deployment/team-operator-controller-manager -- wget -qO- http://localhost:8081/healthz
kubectl exec -n posit-team-system deployment/team-operator-controller-manager -- wget -qO- http://localhost:8081/readyz

2. Verify CRD Versions

# List all Posit Team CRDs with versions
kubectl get crds -o custom-columns=NAME:.metadata.name,VERSION:.spec.versions[0].name | grep posit.team

# Expected output:
# chronicles.core.posit.team        v1beta1
# connects.core.posit.team          v1beta1
# flightdecks.core.posit.team       v1beta1
# packagemanagers.core.posit.team   v1beta1
# postgresdatabases.core.posit.team v1beta1
# sites.core.posit.team             v1beta1
# workbenches.core.posit.team       v1beta1

3. Test Product Functionality

# Check all Sites are reconciling
kubectl get sites -A

# Check individual product resources
kubectl get workbenches -A
kubectl get connects -A
kubectl get packagemanagers -A

# Verify deployments are healthy
kubectl get deployments -n posit-team

# Test product endpoints
curl -I https://workbench.<your-domain>
curl -I https://connect.<your-domain>
curl -I https://packagemanager.<your-domain>

4. Monitor for Issues

Watch operator logs for the first 15-30 minutes after upgrade:

kubectl logs -n posit-team-system deployment/team-operator-controller-manager -f

Look for:

  • Reconciliation errors
  • CRD validation failures
  • Database connection issues
  • Certificate/TLS errors

Rollback Procedures

Helm Rollback

If issues occur after upgrade, rollback to the previous release:

# List release history
helm history team-operator -n posit-team-system

# Rollback to previous revision
helm rollback team-operator <revision-number> -n posit-team-system

# Example: rollback to revision 2
helm rollback team-operator 2 -n posit-team-system

CRD Considerations During Rollback

Important: CRDs are not rolled back with Helm rollback due to the keep annotation. If the new CRDs added fields, older operator versions may still work but won't recognize new fields.

If CRD rollback is necessary:

# Save current CRs
kubectl get sites,workbenches,connects,packagemanagers -A -o yaml > pre-rollback-backup.yaml

# Apply old CRDs (from your backup or previous chart version)
kubectl apply -f old-crds/

# Verify CRs are still valid
kubectl get sites -A

Data Implications

Consider these data implications during rollback:

  1. Database Schema Changes: If the upgrade included database schema changes, rollback may require database schema rollback as well.

  2. Secret Format Changes: The operator's automatic key migration is one-way. Rolled-back operators will still work with migrated keys.

  3. Configuration Changes: CRs modified to use new fields will need manual cleanup if rolling back to a version that doesn't support those fields.

Zero-Downtime Upgrades

Best Practices for Production Upgrades

  1. Use Maintenance Windows: Schedule upgrades during low-traffic periods.

  2. Rolling Update Strategy: The operator uses a single replica by default. During operator restarts:

    • Products continue running if the operator is briefly unavailable
    • No reconciliation occurs during operator restart (typically < 30 seconds)
  3. Staged Rollout:

    # First, upgrade operator in staging
    helm upgrade team-operator ./dist/chart -n posit-team-system-staging
    
    # Verify staging works
    # Then upgrade production
    helm upgrade team-operator ./dist/chart -n posit-team-system
  4. Health Check:

    • Liveness probe: /healthz (port 8081)
    • Readiness probe: /readyz (port 8081)
    • These probes ensure the operator is ready before receiving reconciliation requests
  5. Leader Election: If running multiple operator replicas (uncommon), leader election ensures one active reconciler:

    controllerManager:
      container:
        args:
          - "--leader-elect"

Product Availability During Upgrades

  • Workbench: Sessions continue running; new sessions may be delayed
  • Connect: Published content remains accessible
  • Package Manager: Package downloads continue working
  • Flightdeck: Landing page remains accessible

Only reconciliation (applying changes) is affected during operator restart.

Troubleshooting Upgrades

Common Upgrade Issues

CRD Validation Failures

Symptom: CRs fail validation after CRD update

# Check for invalid CRs
kubectl get sites -A 2>&1 | grep -i error

# View validation errors
kubectl describe site <site-name> -n <namespace>

Solution: Update CRs to match new schema requirements or remove deprecated fields.

Webhook Issues

Symptom: Admission webhook errors after upgrade

# Check webhook configuration
kubectl get validatingwebhookconfigurations | grep posit
kubectl get mutatingwebhookconfigurations | grep posit

# If webhooks are causing issues and you need to disable temporarily
kubectl delete validatingwebhookconfigurations <webhook-name>

Solution: Ensure cert-manager is properly configured if webhooks are enabled.

Operator Pod CrashLoopBackOff

Symptom: Operator pod fails to start

# Check pod events
kubectl describe pod -n posit-team-system -l control-plane=controller-manager

# Check logs
kubectl logs -n posit-team-system -l control-plane=controller-manager --previous

Common Causes:

  • Missing RBAC permissions for new resources
  • Invalid environment variables
  • Certificate issues

Solution: Check Helm values and ensure all required permissions are granted.

Reconciliation Loops

Symptom: Operator continuously reconciles resources without reaching a stable state

# Watch operator logs for repeated reconciliation
kubectl logs -n posit-team-system deployment/team-operator-controller-manager -f | grep "Reconciling"

Solution: Check for label/annotation conflicts or resources being modified by multiple controllers.

Database Connection Errors

Symptom: Products fail to start due to database errors

# Check database connectivity
kubectl logs -n posit-team <product-pod> | grep -i database

Solution: Verify database credentials in secrets and ensure network policies allow database access.

Getting Help

If you encounter issues not covered in this guide:

  1. Check Operator Logs:

    kubectl logs -n posit-team-system deployment/team-operator-controller-manager --tail=200
  2. Review GitHub Issues: Check existing issues

  3. Contact Support: Contact Posit for enterprise support

  4. Collect Diagnostic Information:

    # Create a diagnostic bundle
    kubectl get all -n posit-team-system -o yaml > diag-system.yaml
    kubectl get sites,workbenches,connects,packagemanagers -A -o yaml > diag-resources.yaml
    kubectl logs -n posit-team-system deployment/team-operator-controller-manager > diag-logs.txt

Related Documentation