Skip to content

Runbooks

Overview

Operational runbooks for common tasks and incident response.

Common Operations

Deploy Application Update

# 1. Push changes to Git
git add . && git commit -m "Update" && git push

# 2. ArgoCD will auto-sync, or manual sync:
argocd app sync <app-name>

# 3. Monitor rollout
kubectl rollout status deployment/<name> -n <namespace>

# 4. Verify pods healthy
kubectl get pods -n <namespace>

Scale Deployment

# Scale replicas
kubectl scale deployment/<name> -n <namespace> --replicas=3

# Verify scaling
kubectl get pods -n <namespace> -w

Restart Application

# Rolling restart
kubectl rollout restart deployment/<name> -n <namespace>

# Monitor restart
kubectl rollout status deployment/<name> -n <namespace>

Incident Response

Pod CrashLooping

Symptoms: Pod repeatedly restarts

Steps:

  1. Check pod status

    kubectl describe pod <pod-name> -n <namespace>
    

  2. View logs

    kubectl logs <pod-name> -n <namespace> --previous
    

  3. Check resource limits

    kubectl top pod <pod-name> -n <namespace>
    

  4. Common causes:

  5. OOMKilled: Increase memory limit
  6. Config error: Check configmaps/secrets
  7. Dependency unavailable: Check network/services

Node Not Ready

Symptoms: Node shows NotReady status

Steps:

  1. Check node status

    kubectl describe node <node-name>
    

  2. Check kubelet

    ssh <node> systemctl status k3s-agent
    

  3. Check disk pressure

    ssh <node> df -h
    

  4. Check memory

    ssh <node> free -m
    

  5. Remediation:

  6. Restart kubelet: systemctl restart k3s-agent
  7. Clear disk space
  8. Drain and reboot node

Database Connection Failure

Symptoms: Applications can't connect to database

Steps:

  1. Check database pod

    kubectl get pods -n databases
    

  2. Test connectivity

    kubectl exec -it <app-pod> -- nc -zv mongodb.databases.svc 27017
    

  3. Check service

    kubectl get svc -n databases
    kubectl get endpoints -n databases
    

  4. Check database logs

    kubectl logs -n databases deployment/mongodb
    

High CPU/Memory

Symptoms: Slow performance, alerts firing

Steps:

  1. Identify top consumers

    kubectl top pods --all-namespaces | sort -k3 -rn | head
    

  2. Check node resources

    kubectl top nodes
    

  3. Scale or adjust limits

    kubectl edit deployment <name> -n <namespace>
    

Storage Full

Symptoms: Write errors, pod failures

Steps:

  1. Check PVC usage

    kubectl exec -it <pod> -- df -h
    

  2. Expand volume (if supported)

    kubectl edit pvc <name> -n <namespace>
    # Increase spec.resources.requests.storage
    

  3. Clean up old data

    kubectl exec -it <pod> -- find /data -mtime +30 -delete
    

Maintenance Tasks

Certificate Renewal

Cloudflare manages external certificates automatically.

Update Kubernetes

# Backup etcd
kubectl exec -n kube-system etcd-master -- \
  etcdctl snapshot save /backup/etcd-snapshot.db

# Update master
ssh master
curl -sfL https://get.k3s.io | sh -

# Update workers (one at a time)
ssh worker-1
curl -sfL https://get.k3s.io | K3S_URL=https://master:6443 K3S_TOKEN=<token> sh -

Rotate Secrets

# Generate new secret
kubectl create secret generic <name> \
  --from-literal=key=new-value \
  -n <namespace> \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart deployments using secret
kubectl rollout restart deployment/<name> -n <namespace>

Monitoring Checks

Daily Checks

Check Command
Node status kubectl get nodes
Pod health kubectl get pods -A \| grep -v Running
PVC status kubectl get pvc -A
Alerts Check Alertmanager

Weekly Checks

Check Action
Backup verification Test restore
Log review Check for errors
Resource trends Review Grafana
Security updates Check CVEs

Emergency Contacts

Role Contact
Primary On-Call -
Escalation -
Infrastructure -

Post-Incident

Template

## Incident Report

**Date**: YYYY-MM-DD
**Duration**: X hours
**Severity**: High/Medium/Low

### Summary
Brief description of what happened.

### Timeline
- HH:MM - Event detected
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Resolution applied
- HH:MM - Incident resolved

### Root Cause
What caused the incident.

### Resolution
How the incident was resolved.

### Action Items
- [ ] Preventive measure 1
- [ ] Preventive measure 2