Monitoring

Overview

Monitoring your Kubernetes applications is crucial for maintaining reliability and performance. This guide covers monitoring setup for the DineTogether infrastructure.

Quick Status Checks

Application Health

# Check pod status
kubectl get pods -n test-staging

# Check resource usage
kubectl top pods -n test-staging

# View recent events
kubectl get events -n test-staging --sort-by='.lastTimestamp'

Real-time Monitoring

# Watch pods
kubectl get pods -n test-staging -w

# Follow logs
kubectl logs -f deployment/myapp -n test-staging

# Monitor multiple pods
kubectl logs -f -l app=myapp -n test-staging --all-containers

Built-in Monitoring

K3s Metrics Server

K3s includes a metrics server for basic resource monitoring:

# Pod metrics
kubectl top pods -n test-staging

# Node metrics
kubectl top nodes

# Sort by CPU
kubectl top pods -n test-staging --sort-by=cpu

# Sort by memory
kubectl top pods -n test-staging --sort-by=memory

Application Logs

# View logs
kubectl logs deployment/myapp -n test-staging

# Last 100 lines
kubectl logs deployment/myapp -n test-staging --tail=100

# Since timestamp
kubectl logs deployment/myapp -n test-staging --since=1h

# Export logs
kubectl logs deployment/myapp -n test-staging > myapp.log

Health Checks

Configure Health Endpoints

In your application:

// Next.js /pages/api/health.js
export default function handler(req, res) {
  res.status(200).json({ 
    status: 'healthy',
    timestamp: new Date().toISOString(),
    uptime: process.uptime()
  });
}

# Django /health/
from django.http import JsonResponse
import time

start_time = time.time()

def health_check(request):
    return JsonResponse({
        'status': 'healthy',
        'timestamp': datetime.now().isoformat(),
        'uptime': time.time() - start_time
    })

Add to docker-compose.yml

services:
  web:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/api/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

Monitoring Commands

System Overview

# All resources in namespace
kubectl get all -n test-staging

# Detailed view
kubectl get pods,svc,ingress,pvc -n test-staging -o wide

# Resource usage summary
kubectl top pods -n test-staging --sum

Troubleshooting

# Pod not starting
kubectl describe pod <pod-name> -n test-staging

# Service endpoints
kubectl get endpoints -n test-staging

# Ingress status
kubectl describe ingress -n test-staging

# Certificate status
kubectl get certificates -n test-staging

Log Management

Search Logs

# Search for errors
kubectl logs deployment/myapp -n test-staging | grep -i error

# Search with context
kubectl logs deployment/myapp -n test-staging | grep -B5 -A5 "error"

# Multiple keywords
kubectl logs deployment/myapp -n test-staging | grep -E "error|exception|failed"

Log Aggregation

# All pods of a deployment
kubectl logs -l app=myapp -n test-staging --all-containers

# Save logs from all pods
for pod in $(kubectl get pods -n test-staging -l app=myapp -o name); do
  kubectl logs $pod -n test-staging > ${pod##*/}.log
done

Performance Monitoring

Response Time Testing

# Basic test
curl -w "@curl-format.txt" -o /dev/null -s https://myapp.test.dinetogether.co.uk

# Create curl-format.txt
cat > curl-format.txt << 'EOF'
time_namelookup:  %{time_namelookup}s\n
time_connect:  %{time_connect}s\n
time_appconnect:  %{time_appconnect}s\n
time_pretransfer:  %{time_pretransfer}s\n
time_redirect:  %{time_redirect}s\n
time_starttransfer:  %{time_starttransfer}s\n
----------\n
time_total:  %{time_total}s\n
EOF

Load Testing

# Simple load test
for i in {1..100}; do
  curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" https://myapp.test.dinetogether.co.uk &
done
wait

Alerts and Notifications

Manual Checks

# Check if app is up
if ! curl -f -s https://myapp.test.dinetogether.co.uk/health > /dev/null; then
  echo "Application is down!"
  # Send notification
fi

# Monitor pod restarts
kubectl get pods -n test-staging -o json | jq '.items[] | select(.status.containerStatuses[].restartCount > 5) | .metadata.name'

Automated Monitoring Script

#!/bin/bash
# monitor.sh

NAMESPACE="test-staging"
APP_URL="https://myapp.test.dinetogether.co.uk/health"

# Check pods
NOT_RUNNING=$(kubectl get pods -n $NAMESPACE -o json | jq '.items[] | select(.status.phase != "Running") | .metadata.name' | wc -l)

if [ $NOT_RUNNING -gt 0 ]; then
  echo "WARNING: $NOT_RUNNING pods not running"
  kubectl get pods -n $NAMESPACE | grep -v Running
fi

# Check endpoint
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" $APP_URL)

if [ $HTTP_CODE -ne 200 ]; then
  echo "ERROR: Health check failed with code $HTTP_CODE"
fi

# Check certificate expiry
kubectl get certificates -n $NAMESPACE -o json | jq '.items[] | select(.status.renewalTime < now) | .metadata.name'

Dashboard Options

K9s (Recommended CLI Tool)

# Install k9s
brew install k9s  # macOS
# or download from https://github.com/derailed/k9s

# Run k9s
k9s -n test-staging

# Navigation:
# :pods - View pods
# :svc - View services  
# :logs - View logs
# :events - View events

Simple Web Dashboard

# Port forward to access metrics
kubectl port-forward -n kube-system deployment/metrics-server 8080:443

# Access metrics API
curl -k https://localhost:8080/metrics

Best Practices

1. Implement Health Checks

# Comprehensive health checks
livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 5

2. Use Structured Logging

// Use JSON logs
console.log(JSON.stringify({
  level: 'info',
  message: 'Request processed',
  duration: responseTime,
  statusCode: res.statusCode,
  timestamp: new Date().toISOString()
}));

3. Monitor Key Metrics

Response time
Error rate
Request volume
Resource usage
Pod restarts

4. Set Resource Limits

resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"
    cpu: "500m"

Debugging Performance Issues

High Memory Usage

# Find memory-hungry pods
kubectl top pods -n test-staging --sort-by=memory

# Check for memory leaks
kubectl exec -it <pod-name> -n test-staging -- /bin/sh
# Inside pod: top, ps aux

High CPU Usage

# Find CPU-intensive pods
kubectl top pods -n test-staging --sort-by=cpu

# Profile application
kubectl exec -it <pod-name> -n test-staging -- /bin/sh
# Use language-specific profilers

Slow Response Times

# Check pod logs for slow queries
kubectl logs deployment/myapp -n test-staging | grep -E "took [0-9]{4,}ms"

# Test internal service response
kubectl run test --rm -it --image=busybox -- wget -O- -T5 http://myapp

Maintenance Tasks

Log Rotation

# Export and archive logs
DATE=$(date +%Y%m%d)
kubectl logs deployment/myapp -n test-staging > logs/myapp-$DATE.log
gzip logs/myapp-$DATE.log

# Clean old logs
find logs -name "*.log.gz" -mtime +30 -delete

Resource Cleanup

# Find unused resources
kubectl get pvc -n test-staging | grep -v Bound
kubectl get secrets -n test-staging | grep -v ghcr-secret

# Clean evicted pods
kubectl delete pods -n test-staging --field-selector=status.phase=Failed

Next Steps

Set up alerts for critical issues
Create dashboards for key metrics
Implement log aggregation for better search
Add APM (Application Performance Monitoring)
Configure backups for persistent data

Documentation Monitoring

Check Documentation Status

# Check if docs are running
kubectl get pods -n test-staging | grep docs

# View documentation logs
kubectl logs deployment/docs -n test-staging

# Check ingress
kubectl get ingress -n test-staging | grep docs

Monitor Documentation Access

# View access logs (who's reading docs)
kubectl logs deployment/docs -n test-staging | grep "GET /" | tail -20

# Check authentication attempts
kubectl logs deployment/docs -n test-staging | grep "401" | tail -10

# Monitor resource usage
kubectl top pod -l app=docs -n test-staging