Rollback Procedures

Overview

When a deployment goes wrong, you need to quickly restore service. This guide covers rollback strategies for the DineTogether infrastructure.

Quick Rollback Commands

Immediate Rollback (Last Known Good)

# Rollback to previous version
kubectl rollout undo deployment/myapp -n test-staging

# Check rollback status
kubectl rollout status deployment/myapp -n test-staging

Rollback to Specific Version

# View rollout history
kubectl rollout history deployment/myapp -n test-staging

# Rollback to specific revision
kubectl rollout undo deployment/myapp --to-revision=5 -n test-staging

Finding the Right Version

Check Deployment History

# List all revisions
kubectl rollout history deployment/myapp -n test-staging

# View specific revision details
kubectl rollout history deployment/myapp -n test-staging --revision=5

Find Working Image Tags

# List recent images from GitHub
gh api /orgs/dine-together/packages/container/myapp/versions \
  --jq '.[0:10] | .[] | {tag: .metadata.container.tags[], created: .created_at}'

# Check what's currently running
kubectl get deployment myapp -n test-staging -o jsonpath='{.spec.template.spec.containers[0].image}'

Rollback Strategies

Strategy 1: Kubernetes Native Rollback

Best for: Recent deployments, configuration changes

# Rollback deployment
kubectl rollout undo deployment/myapp -n test-staging

# Verify pods are updating
kubectl get pods -n test-staging -w

Strategy 2: Manual Image Update

Best for: Specific version needed, cross-environment rollback

# Update to specific image
kubectl set image deployment/myapp \
  myapp=ghcr.io/dine-together/myapp:abc123def \
  -n test-staging

# Force restart with new image
kubectl rollout restart deployment/myapp -n test-staging

Strategy 3: Git Revert

Best for: Complex changes, maintaining history

# Revert the problematic commit
git revert HEAD
git push origin main

# This triggers new deployment with reverted code

Strategy 4: Emergency Replace

Best for: Corrupted deployment, complete failure

# Delete current deployment
kubectl delete deployment myapp -n test-staging

# Apply known good configuration
kubectl apply -f backup/myapp-deployment.yaml -n test-staging

Pre-Rollback Checklist

Identify the Issue

# Check pod status
kubectl get pods -n test-staging

# Check recent events
kubectl get events -n test-staging --sort-by='.lastTimestamp'

# View logs
kubectl logs deployment/myapp -n test-staging

Capture Current State

# Save current deployment
kubectl get deployment myapp -n test-staging -o yaml > myapp-current.yaml

# Record problematic image
kubectl get deployment myapp -n test-staging -o jsonpath='{.spec.template.spec.containers[0].image}'

Notify Team
Alert about rollback
Document issue
Create incident ticket

Rollback Scenarios

Scenario 1: Application Won't Start

Symptoms: CrashLoopBackOff, restart loops

# Quick rollback
kubectl rollout undo deployment/myapp -n test-staging

# If that doesn't work, try previous image
kubectl set image deployment/myapp \
  myapp=ghcr.io/dine-together/myapp:previous-sha \
  -n test-staging

Scenario 2: Bad Configuration

Symptoms: Wrong environment variables, missing secrets

# Rollback deployment (includes ConfigMap)
kubectl rollout undo deployment/myapp -n test-staging

# Or update specific config
kubectl set env deployment/myapp \
  API_URL=https://api.test.dinetogether.co.uk \
  -n test-staging

Scenario 3: Performance Issues

Symptoms: High CPU/memory, slow responses

# Rollback first
kubectl rollout undo deployment/myapp -n test-staging

# Then investigate
kubectl top pods -n test-staging
kubectl describe pod <pod-name> -n test-staging

Scenario 4: Breaking Changes

Symptoms: API incompatibility, frontend/backend mismatch

# Rollback both services
kubectl rollout undo deployment/frontend -n test-staging
kubectl rollout undo deployment/backend -n test-staging

# Ensure compatible versions

Verification Steps

1. Check Rollback Progress

# Watch rollout status
kubectl rollout status deployment/myapp -n test-staging -w

# Monitor pods
kubectl get pods -n test-staging -w -l app=myapp

2. Verify Application Health

# Check endpoints
kubectl get endpoints myapp -n test-staging

# Test internally
kubectl run test --rm -it --image=busybox -- wget -O- http://myapp

# Check external access
curl https://myapp.test.dinetogether.co.uk/health

3. Monitor Logs

# Stream logs
kubectl logs -f deployment/myapp -n test-staging

# Check for errors
kubectl logs deployment/myapp -n test-staging | grep -i error

Preventing Future Issues

1. Backup Configurations

# Before deployment
kubectl get deployment myapp -n test-staging -o yaml > backups/myapp-$(date +%Y%m%d).yaml

2. Test in Staging First

# Use separate environments
namespaces:
  - test-staging     # Test here first
  - test-production  # Then deploy here

3. Gradual Rollouts

# In docker-compose.yml
deploy:
  replicas: 3
  update_config:
    parallelism: 1  # Update one at a time
    delay: 10s      # Wait between updates

4. Health Checks

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
  interval: 30s
  timeout: 10s
  retries: 3

Emergency Procedures

Complete Service Failure

Switch to maintenance mode

# Deploy maintenance page
kubectl apply -f emergency/maintenance.yaml -n test-staging

Restore from backup

# Use last known good configuration
kubectl apply -f backups/myapp-20240115.yaml -n test-staging

Verify restoration

kubectl rollout status deployment/myapp -n test-staging
kubectl get pods -n test-staging

Database Corruption

Stop application

kubectl scale deployment/myapp --replicas=0 -n test-staging

Restore database

# Connect to database pod
kubectl exec -it postgres-0 -n test-staging -- /bin/bash

# Restore from backup
pg_restore -d myapp /backups/myapp-20240115.sql

Restart application

kubectl scale deployment/myapp --replicas=3 -n test-staging

Post-Rollback Actions

Document the Incident
What went wrong
Rollback steps taken
Time to recovery
Root cause
Update Monitoring
Add alerts for the issue
Update health checks
Review dashboards
Fix Forward
Create fix branch
Test thoroughly
Deploy with confidence
Update Runbooks
Document new procedures
Update emergency contacts
Review rollback process

Rollback Commands Reference

# Basic rollback
kubectl rollout undo deployment/myapp -n test-staging

# Specific revision
kubectl rollout undo deployment/myapp --to-revision=5 -n test-staging

# Update image
kubectl set image deployment/myapp myapp=ghcr.io/dine-together/myapp:tag -n test-staging

# Scale to zero (stop)
kubectl scale deployment/myapp --replicas=0 -n test-staging

# Scale back up
kubectl scale deployment/myapp --replicas=3 -n test-staging

# Delete and recreate
kubectl delete deployment myapp -n test-staging
kubectl apply -f myapp-deployment.yaml -n test-staging

# Emergency restart
kubectl rollout restart deployment/myapp -n test-staging