Runbooks¶
Overview¶
Operational runbooks for common tasks and incident response.
Common Operations¶
Deploy Application Update¶
# 1. Push changes to Git
git add . && git commit -m "Update" && git push
# 2. ArgoCD will auto-sync, or manual sync:
argocd app sync <app-name>
# 3. Monitor rollout
kubectl rollout status deployment/<name> -n <namespace>
# 4. Verify pods healthy
kubectl get pods -n <namespace>
Scale Deployment¶
# Scale replicas
kubectl scale deployment/<name> -n <namespace> --replicas=3
# Verify scaling
kubectl get pods -n <namespace> -w
Restart Application¶
# Rolling restart
kubectl rollout restart deployment/<name> -n <namespace>
# Monitor restart
kubectl rollout status deployment/<name> -n <namespace>
Incident Response¶
Pod CrashLooping¶
Symptoms: Pod repeatedly restarts
Steps:
-
Check pod status
-
View logs
-
Check resource limits
-
Common causes:
- OOMKilled: Increase memory limit
- Config error: Check configmaps/secrets
- Dependency unavailable: Check network/services
Node Not Ready¶
Symptoms: Node shows NotReady status
Steps:
-
Check node status
-
Check kubelet
-
Check disk pressure
-
Check memory
-
Remediation:
- Restart kubelet:
systemctl restart k3s-agent - Clear disk space
- Drain and reboot node
Database Connection Failure¶
Symptoms: Applications can't connect to database
Steps:
-
Check database pod
-
Test connectivity
-
Check service
-
Check database logs
High CPU/Memory¶
Symptoms: Slow performance, alerts firing
Steps:
-
Identify top consumers
-
Check node resources
-
Scale or adjust limits
Storage Full¶
Symptoms: Write errors, pod failures
Steps:
-
Check PVC usage
-
Expand volume (if supported)
-
Clean up old data
Maintenance Tasks¶
Certificate Renewal¶
Cloudflare manages external certificates automatically.
Update Kubernetes¶
# Backup etcd
kubectl exec -n kube-system etcd-master -- \
etcdctl snapshot save /backup/etcd-snapshot.db
# Update master
ssh master
curl -sfL https://get.k3s.io | sh -
# Update workers (one at a time)
ssh worker-1
curl -sfL https://get.k3s.io | K3S_URL=https://master:6443 K3S_TOKEN=<token> sh -
Rotate Secrets¶
# Generate new secret
kubectl create secret generic <name> \
--from-literal=key=new-value \
-n <namespace> \
--dry-run=client -o yaml | kubectl apply -f -
# Restart deployments using secret
kubectl rollout restart deployment/<name> -n <namespace>
Monitoring Checks¶
Daily Checks¶
| Check | Command |
|---|---|
| Node status | kubectl get nodes |
| Pod health | kubectl get pods -A \| grep -v Running |
| PVC status | kubectl get pvc -A |
| Alerts | Check Alertmanager |
Weekly Checks¶
| Check | Action |
|---|---|
| Backup verification | Test restore |
| Log review | Check for errors |
| Resource trends | Review Grafana |
| Security updates | Check CVEs |
Emergency Contacts¶
| Role | Contact |
|---|---|
| Primary On-Call | - |
| Escalation | - |
| Infrastructure | - |
Post-Incident¶
Template¶
## Incident Report
**Date**: YYYY-MM-DD
**Duration**: X hours
**Severity**: High/Medium/Low
### Summary
Brief description of what happened.
### Timeline
- HH:MM - Event detected
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Resolution applied
- HH:MM - Incident resolved
### Root Cause
What caused the incident.
### Resolution
How the incident was resolved.
### Action Items
- [ ] Preventive measure 1
- [ ] Preventive measure 2