Monitoring
Overview
Monitoring your Kubernetes applications is crucial for maintaining reliability and performance. This guide covers monitoring setup for the DineTogether infrastructure.
Quick Status Checks
Application Health
# Check pod status
kubectl get pods -n test-staging
# Check resource usage
kubectl top pods -n test-staging
# View recent events
kubectl get events -n test-staging --sort-by='.lastTimestamp'
Real-time Monitoring
# Watch pods
kubectl get pods -n test-staging -w
# Follow logs
kubectl logs -f deployment/myapp -n test-staging
# Monitor multiple pods
kubectl logs -f -l app=myapp -n test-staging --all-containers
Built-in Monitoring
K3s Metrics Server
K3s includes a metrics server for basic resource monitoring:
# Pod metrics
kubectl top pods -n test-staging
# Node metrics
kubectl top nodes
# Sort by CPU
kubectl top pods -n test-staging --sort-by=cpu
# Sort by memory
kubectl top pods -n test-staging --sort-by=memory
Application Logs
# View logs
kubectl logs deployment/myapp -n test-staging
# Last 100 lines
kubectl logs deployment/myapp -n test-staging --tail=100
# Since timestamp
kubectl logs deployment/myapp -n test-staging --since=1h
# Export logs
kubectl logs deployment/myapp -n test-staging > myapp.log
Health Checks
Configure Health Endpoints
In your application:
// Next.js /pages/api/health.js
export default function handler(req, res) {
res.status(200).json({
status: 'healthy',
timestamp: new Date().toISOString(),
uptime: process.uptime()
});
}
# Django /health/
from django.http import JsonResponse
import time
start_time = time.time()
def health_check(request):
return JsonResponse({
'status': 'healthy',
'timestamp': datetime.now().isoformat(),
'uptime': time.time() - start_time
})
Add to docker-compose.yml
services:
web:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/api/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
Monitoring Commands
System Overview
# All resources in namespace
kubectl get all -n test-staging
# Detailed view
kubectl get pods,svc,ingress,pvc -n test-staging -o wide
# Resource usage summary
kubectl top pods -n test-staging --sum
Troubleshooting
# Pod not starting
kubectl describe pod <pod-name> -n test-staging
# Service endpoints
kubectl get endpoints -n test-staging
# Ingress status
kubectl describe ingress -n test-staging
# Certificate status
kubectl get certificates -n test-staging
Log Management
Search Logs
# Search for errors
kubectl logs deployment/myapp -n test-staging | grep -i error
# Search with context
kubectl logs deployment/myapp -n test-staging | grep -B5 -A5 "error"
# Multiple keywords
kubectl logs deployment/myapp -n test-staging | grep -E "error|exception|failed"
Log Aggregation
# All pods of a deployment
kubectl logs -l app=myapp -n test-staging --all-containers
# Save logs from all pods
for pod in $(kubectl get pods -n test-staging -l app=myapp -o name); do
kubectl logs $pod -n test-staging > ${pod##*/}.log
done
Performance Monitoring
Response Time Testing
# Basic test
curl -w "@curl-format.txt" -o /dev/null -s https://myapp.test.dinetogether.co.uk
# Create curl-format.txt
cat > curl-format.txt << 'EOF'
time_namelookup: %{time_namelookup}s\n
time_connect: %{time_connect}s\n
time_appconnect: %{time_appconnect}s\n
time_pretransfer: %{time_pretransfer}s\n
time_redirect: %{time_redirect}s\n
time_starttransfer: %{time_starttransfer}s\n
----------\n
time_total: %{time_total}s\n
EOF
Load Testing
# Simple load test
for i in {1..100}; do
curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" https://myapp.test.dinetogether.co.uk &
done
wait
Alerts and Notifications
Manual Checks
# Check if app is up
if ! curl -f -s https://myapp.test.dinetogether.co.uk/health > /dev/null; then
echo "Application is down!"
# Send notification
fi
# Monitor pod restarts
kubectl get pods -n test-staging -o json | jq '.items[] | select(.status.containerStatuses[].restartCount > 5) | .metadata.name'
Automated Monitoring Script
#!/bin/bash
# monitor.sh
NAMESPACE="test-staging"
APP_URL="https://myapp.test.dinetogether.co.uk/health"
# Check pods
NOT_RUNNING=$(kubectl get pods -n $NAMESPACE -o json | jq '.items[] | select(.status.phase != "Running") | .metadata.name' | wc -l)
if [ $NOT_RUNNING -gt 0 ]; then
echo "WARNING: $NOT_RUNNING pods not running"
kubectl get pods -n $NAMESPACE | grep -v Running
fi
# Check endpoint
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" $APP_URL)
if [ $HTTP_CODE -ne 200 ]; then
echo "ERROR: Health check failed with code $HTTP_CODE"
fi
# Check certificate expiry
kubectl get certificates -n $NAMESPACE -o json | jq '.items[] | select(.status.renewalTime < now) | .metadata.name'
Dashboard Options
K9s (Recommended CLI Tool)
# Install k9s
brew install k9s # macOS
# or download from https://github.com/derailed/k9s
# Run k9s
k9s -n test-staging
# Navigation:
# :pods - View pods
# :svc - View services
# :logs - View logs
# :events - View events
Simple Web Dashboard
# Port forward to access metrics
kubectl port-forward -n kube-system deployment/metrics-server 8080:443
# Access metrics API
curl -k https://localhost:8080/metrics
Best Practices
1. Implement Health Checks
# Comprehensive health checks
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
2. Use Structured Logging
// Use JSON logs
console.log(JSON.stringify({
level: 'info',
message: 'Request processed',
duration: responseTime,
statusCode: res.statusCode,
timestamp: new Date().toISOString()
}));
3. Monitor Key Metrics
- Response time
- Error rate
- Request volume
- Resource usage
- Pod restarts
4. Set Resource Limits
Debugging Performance Issues
High Memory Usage
# Find memory-hungry pods
kubectl top pods -n test-staging --sort-by=memory
# Check for memory leaks
kubectl exec -it <pod-name> -n test-staging -- /bin/sh
# Inside pod: top, ps aux
High CPU Usage
# Find CPU-intensive pods
kubectl top pods -n test-staging --sort-by=cpu
# Profile application
kubectl exec -it <pod-name> -n test-staging -- /bin/sh
# Use language-specific profilers
Slow Response Times
# Check pod logs for slow queries
kubectl logs deployment/myapp -n test-staging | grep -E "took [0-9]{4,}ms"
# Test internal service response
kubectl run test --rm -it --image=busybox -- wget -O- -T5 http://myapp
Maintenance Tasks
Log Rotation
# Export and archive logs
DATE=$(date +%Y%m%d)
kubectl logs deployment/myapp -n test-staging > logs/myapp-$DATE.log
gzip logs/myapp-$DATE.log
# Clean old logs
find logs -name "*.log.gz" -mtime +30 -delete
Resource Cleanup
# Find unused resources
kubectl get pvc -n test-staging | grep -v Bound
kubectl get secrets -n test-staging | grep -v ghcr-secret
# Clean evicted pods
kubectl delete pods -n test-staging --field-selector=status.phase=Failed
Next Steps
- Set up alerts for critical issues
- Create dashboards for key metrics
- Implement log aggregation for better search
- Add APM (Application Performance Monitoring)
- Configure backups for persistent data
Documentation Monitoring
Check Documentation Status
# Check if docs are running
kubectl get pods -n test-staging | grep docs
# View documentation logs
kubectl logs deployment/docs -n test-staging
# Check ingress
kubectl get ingress -n test-staging | grep docs
Monitor Documentation Access
# View access logs (who's reading docs)
kubectl logs deployment/docs -n test-staging | grep "GET /" | tail -20
# Check authentication attempts
kubectl logs deployment/docs -n test-staging | grep "401" | tail -10
# Monitor resource usage
kubectl top pod -l app=docs -n test-staging