Skip to content

Rollback Procedures

Overview

When a deployment goes wrong, you need to quickly restore service. This guide covers rollback strategies for the DineTogether infrastructure.

Quick Rollback Commands

Immediate Rollback (Last Known Good)

# Rollback to previous version
kubectl rollout undo deployment/myapp -n test-staging

# Check rollback status
kubectl rollout status deployment/myapp -n test-staging

Rollback to Specific Version

# View rollout history
kubectl rollout history deployment/myapp -n test-staging

# Rollback to specific revision
kubectl rollout undo deployment/myapp --to-revision=5 -n test-staging

Finding the Right Version

Check Deployment History

# List all revisions
kubectl rollout history deployment/myapp -n test-staging

# View specific revision details
kubectl rollout history deployment/myapp -n test-staging --revision=5

Find Working Image Tags

# List recent images from GitHub
gh api /orgs/dine-together/packages/container/myapp/versions \
  --jq '.[0:10] | .[] | {tag: .metadata.container.tags[], created: .created_at}'

# Check what's currently running
kubectl get deployment myapp -n test-staging -o jsonpath='{.spec.template.spec.containers[0].image}'

Rollback Strategies

Strategy 1: Kubernetes Native Rollback

Best for: Recent deployments, configuration changes

# Rollback deployment
kubectl rollout undo deployment/myapp -n test-staging

# Verify pods are updating
kubectl get pods -n test-staging -w

Strategy 2: Manual Image Update

Best for: Specific version needed, cross-environment rollback

# Update to specific image
kubectl set image deployment/myapp \
  myapp=ghcr.io/dine-together/myapp:abc123def \
  -n test-staging

# Force restart with new image
kubectl rollout restart deployment/myapp -n test-staging

Strategy 3: Git Revert

Best for: Complex changes, maintaining history

# Revert the problematic commit
git revert HEAD
git push origin main

# This triggers new deployment with reverted code

Strategy 4: Emergency Replace

Best for: Corrupted deployment, complete failure

# Delete current deployment
kubectl delete deployment myapp -n test-staging

# Apply known good configuration
kubectl apply -f backup/myapp-deployment.yaml -n test-staging

Pre-Rollback Checklist

  1. Identify the Issue

    # Check pod status
    kubectl get pods -n test-staging
    
    # Check recent events
    kubectl get events -n test-staging --sort-by='.lastTimestamp'
    
    # View logs
    kubectl logs deployment/myapp -n test-staging
    

  2. Capture Current State

    # Save current deployment
    kubectl get deployment myapp -n test-staging -o yaml > myapp-current.yaml
    
    # Record problematic image
    kubectl get deployment myapp -n test-staging -o jsonpath='{.spec.template.spec.containers[0].image}'
    

  3. Notify Team

  4. Alert about rollback
  5. Document issue
  6. Create incident ticket

Rollback Scenarios

Scenario 1: Application Won't Start

Symptoms: CrashLoopBackOff, restart loops

# Quick rollback
kubectl rollout undo deployment/myapp -n test-staging

# If that doesn't work, try previous image
kubectl set image deployment/myapp \
  myapp=ghcr.io/dine-together/myapp:previous-sha \
  -n test-staging

Scenario 2: Bad Configuration

Symptoms: Wrong environment variables, missing secrets

# Rollback deployment (includes ConfigMap)
kubectl rollout undo deployment/myapp -n test-staging

# Or update specific config
kubectl set env deployment/myapp \
  API_URL=https://api.test.dinetogether.co.uk \
  -n test-staging

Scenario 3: Performance Issues

Symptoms: High CPU/memory, slow responses

# Rollback first
kubectl rollout undo deployment/myapp -n test-staging

# Then investigate
kubectl top pods -n test-staging
kubectl describe pod <pod-name> -n test-staging

Scenario 4: Breaking Changes

Symptoms: API incompatibility, frontend/backend mismatch

# Rollback both services
kubectl rollout undo deployment/frontend -n test-staging
kubectl rollout undo deployment/backend -n test-staging

# Ensure compatible versions

Verification Steps

1. Check Rollback Progress

# Watch rollout status
kubectl rollout status deployment/myapp -n test-staging -w

# Monitor pods
kubectl get pods -n test-staging -w -l app=myapp

2. Verify Application Health

# Check endpoints
kubectl get endpoints myapp -n test-staging

# Test internally
kubectl run test --rm -it --image=busybox -- wget -O- http://myapp

# Check external access
curl https://myapp.test.dinetogether.co.uk/health

3. Monitor Logs

# Stream logs
kubectl logs -f deployment/myapp -n test-staging

# Check for errors
kubectl logs deployment/myapp -n test-staging | grep -i error

Preventing Future Issues

1. Backup Configurations

# Before deployment
kubectl get deployment myapp -n test-staging -o yaml > backups/myapp-$(date +%Y%m%d).yaml

2. Test in Staging First

# Use separate environments
namespaces:
  - test-staging     # Test here first
  - test-production  # Then deploy here

3. Gradual Rollouts

# In docker-compose.yml
deploy:
  replicas: 3
  update_config:
    parallelism: 1  # Update one at a time
    delay: 10s      # Wait between updates

4. Health Checks

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
  interval: 30s
  timeout: 10s
  retries: 3

Emergency Procedures

Complete Service Failure

  1. Switch to maintenance mode

    # Deploy maintenance page
    kubectl apply -f emergency/maintenance.yaml -n test-staging
    

  2. Restore from backup

    # Use last known good configuration
    kubectl apply -f backups/myapp-20240115.yaml -n test-staging
    

  3. Verify restoration

    kubectl rollout status deployment/myapp -n test-staging
    kubectl get pods -n test-staging
    

Database Corruption

  1. Stop application

    kubectl scale deployment/myapp --replicas=0 -n test-staging
    

  2. Restore database

    # Connect to database pod
    kubectl exec -it postgres-0 -n test-staging -- /bin/bash
    
    # Restore from backup
    pg_restore -d myapp /backups/myapp-20240115.sql
    

  3. Restart application

    kubectl scale deployment/myapp --replicas=3 -n test-staging
    

Post-Rollback Actions

  1. Document the Incident
  2. What went wrong
  3. Rollback steps taken
  4. Time to recovery
  5. Root cause

  6. Update Monitoring

  7. Add alerts for the issue
  8. Update health checks
  9. Review dashboards

  10. Fix Forward

  11. Create fix branch
  12. Test thoroughly
  13. Deploy with confidence

  14. Update Runbooks

  15. Document new procedures
  16. Update emergency contacts
  17. Review rollback process

Rollback Commands Reference

# Basic rollback
kubectl rollout undo deployment/myapp -n test-staging

# Specific revision
kubectl rollout undo deployment/myapp --to-revision=5 -n test-staging

# Update image
kubectl set image deployment/myapp myapp=ghcr.io/dine-together/myapp:tag -n test-staging

# Scale to zero (stop)
kubectl scale deployment/myapp --replicas=0 -n test-staging

# Scale back up
kubectl scale deployment/myapp --replicas=3 -n test-staging

# Delete and recreate
kubectl delete deployment myapp -n test-staging
kubectl apply -f myapp-deployment.yaml -n test-staging

# Emergency restart
kubectl rollout restart deployment/myapp -n test-staging

Next Steps