Troubleshooting Guide

This document provides solutions for common issues and debugging techniques.

Common Issues
Debugging Commands
Error Messages
Performance Issues
Incident References

Common Issues

Issue 1: API Endpoints Must Not Block

Symptom: API endpoint takes 30+ seconds to respond

Cause: API is directly executing K8s operations instead of using reconciliation

Solution: API should only update database, return immediately

// ❌ BAD (blocking)
export async function POST(req: Request) {
  await k8sService.createSandbox() // Blocks for 30s
  return NextResponse.json({ success: true })
}

// ✅ GOOD (non-blocking)
export async function POST(req: Request) {
  await prisma.sandbox.create({
    data: { status: 'CREATING', /* ... */ }
  })
  // Reconciliation will handle K8s operations
  return NextResponse.json({ success: true })
}

Issue 2: Always Use getK8sServiceForUser()

Symptom: "User does not have KUBECONFIG configured"

Cause: Trying to use global K8s service instead of user-specific

Solution: Always load user's kubeconfig from UserConfig table

// ❌ BAD (old pattern)
const k8sService = new KubernetesService()

// ✅ GOOD (v0.4.0+)
const k8sService = await getK8sServiceForUser(userId)

Issue 3: Optimistic Locking Prevents Concurrent Updates

Symptom: Reconciliation job skips some records

Cause: Multiple instances or rapid cycles trying to process same records

Solution: This is expected behavior - optimistic locking ensures single-writer

// Repository layer automatically handles locking
const lockedSandboxes = await acquireAndLockSandboxes(10)
// Only returns sandboxes where lockedUntil IS NULL OR < NOW()
// Sets lockedUntil = NOW() + 30 seconds atomically

Issue 4: Status Aggregation Rules

Symptom: Project shows PARTIAL status unexpectedly

Cause: Child resources in inconsistent states

Solution: Understand aggregation priority rules

Priority order:

ERROR - At least one resource has ERROR
CREATING - At least one resource has CREATING
UPDATING - At least one resource has UPDATING
Pure states - All same status → use that status
Transition states:
- STARTING: All ∈ {RUNNING, STARTING}
- STOPPING: All ∈ {STOPPED, STOPPING}
- TERMINATING: All ∈ {TERMINATED, TERMINATING}
PARTIAL - Inconsistent mixed states

Issue 5: ttyd Authentication Failed

Symptom: Terminal shows "Authentication failed"

Cause: Missing or incorrect TTYD_ACCESS_TOKEN

Solution: Check environment variable

# In sandbox pod
echo $TTYD_ACCESS_TOKEN

# Check URL format
# Should be: https://{domain}?authorization={base64(user:token)}

Issue 6: FileBrowser Login Failed

Symptom: FileBrowser shows "Invalid credentials"

Cause: Missing or incorrect FILE_BROWSER_USERNAME/PASSWORD

Solution: Check environment variables

# In sandbox pod
echo $FILE_BROWSER_USERNAME
echo $FILE_BROWSER_PASSWORD

Issue 7: Database Connection Failed

Symptom: Sandbox can't connect to PostgreSQL

Cause: Database not ready or wrong connection string

Solution: Check database status and connection URL

# Check database status
kubectl get cluster -n {namespace}

# Check connection URL
echo $DATABASE_URL

# Test connection
psql $DATABASE_URL

Debugging Commands

Kubernetes Resources

# Set kubeconfig
export KUBECONFIG=/path/to/kubeconfig

# Check StatefulSets
kubectl get statefulsets -n {namespace} | grep {project-name}

# Check pods
kubectl get pods -n {namespace} -l app={statefulset-name}

# Pod logs
kubectl logs -n {namespace} {pod-name}

# Pod logs (follow)
kubectl logs -f -n {namespace} {pod-name}

# Check KubeBlocks database cluster
kubectl get cluster -n {namespace} | grep {project-name}

# Get database credentials
kubectl get secret -n {namespace} {cluster-name}-conn-credential -o yaml

# Check ingresses
kubectl get ingress -n {namespace} | grep {project-name}

# Describe resource for events
kubectl describe statefulset -n {namespace} {statefulset-name}

Database Queries

# Open Prisma Studio
npx prisma studio

# Direct PostgreSQL queries
psql $DATABASE_URL

# Check locked resources
psql $DATABASE_URL -c "SELECT id, status, \"lockedUntil\" FROM \"Sandbox\" WHERE \"lockedUntil\" IS NOT NULL;"

Application Logs

# Main application logs
kubectl logs -n {namespace} {pod-name}

# Filter by module
kubectl logs -n {namespace} {pod-name} | grep "lib/events/sandbox"

# Filter by level
kubectl logs -n {namespace} {pod-name} | grep "ERROR"

Error Messages

"User does not have KUBECONFIG configured"

Cause: User hasn't uploaded kubeconfig

Solution:

Check UserConfig table for KUBECONFIG key
User needs to configure kubeconfig via UI or API

"Project not found"

Cause: Project doesn't exist or user doesn't have access

Solution:

Check project ID
Check user ID matches project owner
Check namespace matches user's kubeconfig

"Cannot start project - invalid status transition"

Cause: Project not in correct state for start

Solution:

Check current project status
Only STOPPED projects can be started
Wait for current operation to complete

"Environment variables can only be updated when the project is running"

Cause: Project not in RUNNING state

Solution:

Check project status
Start project first
Wait for RUNNING status

"Failed to create sandbox"

Cause: Various K8s errors

Solution:

Check K8s events: kubectl describe statefulset
Check resource quotas
Check image availability
Check namespace exists

Performance Issues

Slow API Responses

Possible Causes:

Database query performance
Missing indexes
N+1 query problem

Solutions:

// Use include for relations
await prisma.project.findMany({
  include: { sandboxes: true, databases: true }
})

// Check query performance

## Incident References

- GitHub import delay postmortem:
  - [project-import-delay-postmortem.md](/Users/che/Documents/GitHub/fulling/docs/project-import-delay-postmortem.md)
const prisma = new PrismaClient({
  log: ['query', 'info', 'warn', 'error'],
})

Slow Reconciliation

Possible Causes:

Too many resources to process
K8s API throttling
Lock contention

Solutions:

Reduce batch size in reconciliation job
Increase reconciliation interval
Check K8s API server load

High Memory Usage

Possible Causes:

Memory leaks in event listeners
Large objects in memory
Unclosed connections

Solutions:

Check for memory leaks
Use streaming for large data
Close connections properly

Slow Sandbox Startup

Possible Causes:

Large runtime image
Slow PVC provisioning
Resource constraints

Solutions:

Use smaller base image
Check storage class performance
Increase resource limits

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting Guide

Table of Contents

Common Issues

Issue 1: API Endpoints Must Not Block

Issue 2: Always Use getK8sServiceForUser()

Issue 3: Optimistic Locking Prevents Concurrent Updates

Issue 4: Status Aggregation Rules

Issue 5: ttyd Authentication Failed

Issue 6: FileBrowser Login Failed

Issue 7: Database Connection Failed

Debugging Commands

Kubernetes Resources

Database Queries

Application Logs

Error Messages

"User does not have KUBECONFIG configured"

"Project not found"

"Cannot start project - invalid status transition"

"Environment variables can only be updated when the project is running"

"Failed to create sandbox"

Performance Issues

Slow API Responses

Slow Reconciliation

High Memory Usage

Slow Sandbox Startup

Related Documentation

FilesExpand file tree

troubleshooting.md

Latest commit

History

troubleshooting.md

File metadata and controls

Troubleshooting Guide

Table of Contents

Common Issues

Issue 1: API Endpoints Must Not Block

Issue 2: Always Use getK8sServiceForUser()

Issue 3: Optimistic Locking Prevents Concurrent Updates

Issue 4: Status Aggregation Rules

Issue 5: ttyd Authentication Failed

Issue 6: FileBrowser Login Failed

Issue 7: Database Connection Failed

Debugging Commands

Kubernetes Resources

Database Queries

Application Logs

Error Messages

"User does not have KUBECONFIG configured"

"Project not found"

"Cannot start project - invalid status transition"

"Environment variables can only be updated when the project is running"

"Failed to create sandbox"

Performance Issues

Slow API Responses

Slow Reconciliation

High Memory Usage

Slow Sandbox Startup

Related Documentation