Runbooks¶

Overview¶

Operational runbooks for common tasks and incident response.

Common Operations¶

Deploy Application Update¶

# 1. Push changes to Git
git add . && git commit -m "Update" && git push

# 2. ArgoCD will auto-sync, or manual sync:
argocd app sync <app-name>

# 3. Monitor rollout
kubectl rollout status deployment/<name> -n <namespace>

# 4. Verify pods healthy
kubectl get pods -n <namespace>

Scale Deployment¶

# Scale replicas
kubectl scale deployment/<name> -n <namespace> --replicas=3

# Verify scaling
kubectl get pods -n <namespace> -w

Restart Application¶

# Rolling restart
kubectl rollout restart deployment/<name> -n <namespace>

# Monitor restart
kubectl rollout status deployment/<name> -n <namespace>

Incident Response¶

Pod CrashLooping¶

Symptoms: Pod repeatedly restarts

Steps:

Check pod status

kubectl describe pod <pod-name> -n <namespace>

View logs

kubectl logs <pod-name> -n <namespace> --previous

Check resource limits

kubectl top pod <pod-name> -n <namespace>

Common causes:
OOMKilled: Increase memory limit
Config error: Check configmaps/secrets
Dependency unavailable: Check network/services

Node Not Ready¶

Symptoms: Node shows NotReady status

Steps:

Check node status
```
kubectl describe node <node-name>
```
Check kubelet
```
ssh <node> systemctl status k3s-agent
```
Check disk pressure
```
ssh <node> df -h
```
Check memory
```
ssh <node> free -m
```
Remediation:
Restart kubelet: systemctl restart k3s-agent
Clear disk space
Drain and reboot node

Database Connection Failure¶

Symptoms: Applications can't connect to database

Steps:

Check database pod
```
kubectl get pods -n databases
```

Test connectivity

kubectl exec -it <app-pod> -- nc -zv mongodb.databases.svc 27017

Check service

kubectl get svc -n databases
kubectl get endpoints -n databases

Check database logs

kubectl logs -n databases deployment/mongodb

High CPU/Memory¶

Symptoms: Slow performance, alerts firing

Steps:

Identify top consumers

kubectl top pods --all-namespaces | sort -k3 -rn | head

Check node resources
```
kubectl top nodes
```

Scale or adjust limits

kubectl edit deployment <name> -n <namespace>

Storage Full¶

Symptoms: Write errors, pod failures

Steps:

Check PVC usage
```
kubectl exec -it <pod> -- df -h
```

Expand volume (if supported)

kubectl edit pvc <name> -n <namespace>
# Increase spec.resources.requests.storage

Clean up old data

kubectl exec -it <pod> -- find /data -mtime +30 -delete

Maintenance Tasks¶

Certificate Renewal¶

Cloudflare manages external certificates automatically.

Update Kubernetes¶

# Backup etcd
kubectl exec -n kube-system etcd-master -- \
  etcdctl snapshot save /backup/etcd-snapshot.db

# Update master
ssh master
curl -sfL https://get.k3s.io | sh -

# Update workers (one at a time)
ssh worker-1
curl -sfL https://get.k3s.io | K3S_URL=https://master:6443 K3S_TOKEN=<token> sh -

Rotate Secrets¶

# Generate new secret
kubectl create secret generic <name> \
  --from-literal=key=new-value \
  -n <namespace> \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart deployments using secret
kubectl rollout restart deployment/<name> -n <namespace>

Monitoring Checks¶

Daily Checks¶

Check	Command
Node status	`kubectl get nodes`
Pod health	`kubectl get pods -A \\| grep -v Running`
PVC status	`kubectl get pvc -A`
Alerts	Check Alertmanager

Weekly Checks¶

Check	Action
Backup verification	Test restore
Log review	Check for errors
Resource trends	Review Grafana
Security updates	Check CVEs

Emergency Contacts¶

Role	Contact
Primary On-Call	-
Escalation	-
Infrastructure	-

Post-Incident¶

Template¶

## Incident Report

**Date**: YYYY-MM-DD
**Duration**: X hours
**Severity**: High/Medium/Low

### Summary
Brief description of what happened.

### Timeline
- HH:MM - Event detected
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Resolution applied
- HH:MM - Incident resolved

### Root Cause
What caused the incident.

### Resolution
How the incident was resolved.

### Action Items
- [ ] Preventive measure 1
- [ ] Preventive measure 2