Skip to content

Backup & Recovery

Overview

Backup strategy covers persistent data, configurations, and disaster recovery procedures.

Backup Scope

graph TB
    subgraph Data
        DB[(Databases)]
        PVC[Persistent Volumes]
    end

    subgraph Configuration
        K8s[K8s Manifests]
        Secrets[Secrets]
        ConfigMaps[ConfigMaps]
    end

    subgraph External
        Git[Git Repository]
        S3[Backup Storage]
    end

    DB --> S3
    PVC --> S3
    K8s --> Git
    Secrets --> S3

Database Backups

MongoDB

#!/bin/bash
# MongoDB backup script
BACKUP_DIR="/backup/mongodb/$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR

mongodump \
  --uri="mongodb://user:pass@mongodb:27017" \
  --out=$BACKUP_DIR \
  --gzip

# Retention: 7 days
find /backup/mongodb -mtime +7 -delete

MySQL

#!/bin/bash
# MySQL backup script
BACKUP_FILE="/backup/mysql/backup_$(date +%Y%m%d).sql.gz"

mysqldump \
  -u root -p$MYSQL_ROOT_PASSWORD \
  --all-databases \
  --single-transaction \
  --routines \
  --triggers \
  | gzip > $BACKUP_FILE

# Retention: 7 days
find /backup/mysql -mtime +7 -delete

InfluxDB

#!/bin/bash
# InfluxDB backup script
BACKUP_DIR="/backup/influxdb/$(date +%Y%m%d)"

influx backup $BACKUP_DIR \
  --bucket metrics \
  --org default

# Retention: 4 weeks
find /backup/influxdb -mtime +28 -delete

Elasticsearch

# Create snapshot
curl -X PUT "elasticsearch:9200/_snapshot/backup/snap_$(date +%Y%m%d)" \
  -H 'Content-Type: application/json' \
  -d '{
    "indices": "logs-*,metrics-*",
    "include_global_state": false
  }'

Volume Backups

Longhorn Snapshots

apiVersion: longhorn.io/v1beta1
kind: RecurringJob
metadata:
  name: daily-backup
  namespace: longhorn-system
spec:
  cron: "0 2 * * *"
  task: backup
  groups:
    - default
  retain: 7
  concurrency: 2

Manual Snapshot

# Create snapshot
kubectl apply -f - <<EOF
apiVersion: longhorn.io/v1beta1
kind: Snapshot
metadata:
  name: manual-snapshot
  namespace: longhorn-system
spec:
  volume: my-volume
EOF

Configuration Backup

GitOps

All Kubernetes manifests stored in Git:

Repository Content
hub Hub application manifests
wiki Wiki manifests
infra Infrastructure configs

Secrets Backup

# Export secrets (encrypted)
kubectl get secrets -A -o yaml | \
  kubeseal --format yaml > secrets-backup.yaml

Backup Schedule

Data Frequency Retention Location
MongoDB Daily 2AM 7 days NAS
MySQL Daily 2AM 7 days NAS
InfluxDB Weekly 4 weeks NAS
Elasticsearch Daily 7 days NAS
Longhorn volumes Daily 7 days NAS
Kubernetes configs Git Infinite GitHub

Recovery Procedures

MongoDB Restore

# Full restore
mongorestore \
  --uri="mongodb://user:pass@mongodb:27017" \
  --gzip \
  /backup/mongodb/20240115/

# Single database
mongorestore \
  --uri="mongodb://user:pass@mongodb:27017" \
  --gzip \
  --db hub \
  /backup/mongodb/20240115/hub/

MySQL Restore

# Full restore
gunzip < /backup/mysql/backup_20240115.sql.gz | \
  mysql -u root -p

# Single database
mysql -u root -p app_data < backup.sql

Longhorn Restore

  1. Create volume from backup in Longhorn UI
  2. Create PVC referencing restored volume
  3. Update deployment to use new PVC

Disaster Recovery

Scenario RTO RPO Procedure
Pod failure Minutes 0 Automatic reschedule
Node failure Minutes 0 Pod migration
Database corruption 1 hour 24 hours Restore from backup
Cluster failure 4 hours 24 hours Full rebuild

Verification

Backup Testing

# Test MongoDB backup
mongorestore --dryRun --gzip /backup/mongodb/latest/

# Test MySQL backup
mysql -u root -p test_db < backup.sql
mysql -u root -p -e "SELECT COUNT(*) FROM test_db.users"

Monthly DR Test

  1. Spin up test environment
  2. Restore all backups
  3. Verify application functionality
  4. Document any issues
  5. Update procedures

Monitoring

Backup Alerts

Alert Condition
Backup failed Exit code != 0
Backup old Last backup > 24h
Storage low Backup disk > 80%

Metrics

  • Last backup timestamp
  • Backup size
  • Backup duration
  • Storage utilization

Best Practices

  1. 3-2-1 Rule - 3 copies, 2 media types, 1 offsite
  2. Test restores - Regularly verify backups work
  3. Encrypt backups - Protect sensitive data
  4. Document procedures - Clear runbooks
  5. Automate - Reduce human error
  6. Monitor - Alert on failures