Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion dist/chart/templates/certmanager/certificate.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ spec:
dnsNames:
- team-operator.{{ .Release.Namespace }}.svc
- team-operator.{{ .Release.Namespace }}.svc.cluster.local
- team-operator-metrics-service.{{ .Release.Namespace }}.svc
- team-operator-controller-manager-metrics-service.{{ .Release.Namespace }}.svc
issuerRef:
kind: Issuer
name: selfsigned-issuer
Expand Down
2 changes: 1 addition & 1 deletion dist/chart/templates/metrics/metrics-service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
apiVersion: v1
kind: Service
metadata:
name: team-operator-controller-manager-metrics-service
name: {{ .Values.controllerManager.serviceAccountName }}-metrics-service
namespace: {{ .Release.Namespace }}
labels:
{{- include "chart.labels" . | nindent 4 }}
Expand Down
17 changes: 0 additions & 17 deletions dist/chart/templates/rbac/auth_proxy_service.yaml

This file was deleted.

95 changes: 95 additions & 0 deletions docs/guides/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ This comprehensive guide covers common issues and their solutions when running P

1. [General Debugging](#general-debugging)
2. [Operator Issues](#operator-issues)
- [Operator Pod Stuck in Pending (Scheduling Failures)](#operator-pod-stuck-in-pending-scheduling-failures)
3. [Site Reconciliation Issues](#site-reconciliation-issues)
4. [Database Issues](#database-issues)
5. [Product-Specific Issues](#product-specific-issues)
Expand Down Expand Up @@ -203,6 +204,100 @@ kubectl describe crd sites.core.posit.team
helm upgrade --install team-operator ./dist/chart --set installCRDs=true
```

### Operator Pod Stuck in Pending (Scheduling Failures)

**Symptoms:**
- Operator pod stays in `Pending` state indefinitely
- `kubectl describe pod` shows taint-related scheduling errors
- Events contain messages like `node(s) had taints that the pod didn't tolerate`

**Diagnosis:**
```bash
# Check operator pod status
kubectl get pods -n posit-team-system

# Describe the pod to see scheduling failures
kubectl describe pod -n posit-team-system -l control-plane=controller-manager

# List node taints to understand what tolerations are needed
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
```

**Cause:**

Kubernetes nodes can have [taints](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) that prevent pods from scheduling unless the pod has a matching toleration. Common scenarios include:

- Dedicated node pools for specific workloads (e.g., GPU nodes, session nodes)
- Nodes reserved for critical system components
- Cloud provider managed node pools with default taints

If the operator pod doesn't have tolerations matching the node taints, it will remain in `Pending` state.

**Solution:**

Configure tolerations in your Helm values to match your cluster's node taints:

```yaml
# values.yaml
controllerManager:
tolerations:
# Example: Tolerate nodes tainted for session workloads
- key: "workload-type"
operator: "Equal"
value: "session"
effect: "NoSchedule"

# Example: Tolerate nodes with GPU taints
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"

# Example: Tolerate all taints (use with caution)
# - operator: "Exists"
```

Apply the configuration:

```bash
helm upgrade team-operator ./dist/chart \
--namespace posit-team-system \
-f values.yaml
```

**Common Toleration Patterns:**

| Scenario | Toleration Configuration |
|----------|-------------------------|
| Session-dedicated nodes | `key: "workload-type", value: "session", effect: "NoSchedule"` |
| GPU nodes | `key: "nvidia.com/gpu", operator: "Exists", effect: "NoSchedule"` |
| Cloud provider taints (EKS) | `key: "eks.amazonaws.com/compute-type", operator: "Exists"` |
| Cloud provider taints (GKE) | `key: "cloud.google.com/gke-nodepool", operator: "Exists"` |
| Control plane nodes | `key: "node-role.kubernetes.io/control-plane", operator: "Exists"` |

**Using nodeSelector as an alternative:**

If you want the operator to run on specific nodes instead of tolerating taints, use `nodeSelector`:

```yaml
controllerManager:
nodeSelector:
kubernetes.io/os: linux
node-type: management
```

**Verification:**

After applying tolerations, verify the pod schedules successfully:

```bash
# Check pod is running
kubectl get pods -n posit-team-system

# Verify tolerations were applied
kubectl get deployment team-operator-controller-manager -n posit-team-system \
-o jsonpath='{.spec.template.spec.tolerations}' | jq
```

---

## Site Reconciliation Issues
Expand Down