Alerting¶

Overview¶

Alerting is handled by Prometheus Alertmanager for metric-based alerts, with notifications sent to Discord via webhooks. All critical infrastructure and application alerts are routed to Discord with user mentions for immediate attention.

Architecture¶

graph LR
    subgraph Sources
        P[Prometheus]
        K[Kibana]
    end

    subgraph Routing
        AM[Alertmanager]
    end

    subgraph Notifications
        D[Discord Webhook]
    end

    P -->|Alert Rules| AM
    K -->|Watcher| AM
    AM -->|@mention| D

Discord Integration¶

Alertmanager Discord Configuration¶

Alertmanager is configured to send alerts to Discord via webhook. The webhook URL is stored in a Kubernetes secret:

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-discord
  namespace: monitoring
type: Opaque
stringData:
  webhook-url: "<DISCORD_WEBHOOK_URL>"

Alertmanager Config¶

global:
  resolve_timeout: 5m

route:
  receiver: 'discord-alerts'
  group_by: ['alertname', 'namespace', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
  # Critical alerts - immediate notification
  - match:
      severity: critical
    receiver: 'discord-critical'
    repeat_interval: 1h
    continue: false

  # Warning alerts
  - match:
      severity: warning
    receiver: 'discord-alerts'
    repeat_interval: 4h

  # Nextcloud specific alerts
  - match:
      namespace: nextcloud
    receiver: 'discord-alerts'
    group_by: ['alertname']

receivers:
- name: 'discord-alerts'
  webhook_configs:
  - url_file: '/etc/alertmanager/secrets/webhook-url'
    send_resolved: true
    http_config:
      follow_redirects: true

- name: 'discord-critical'
  webhook_configs:
  - url_file: '/etc/alertmanager/secrets/webhook-url'
    send_resolved: true
    http_config:
      follow_redirects: true

templates:
- '/etc/alertmanager/templates/*.tmpl'

Discord Message Template¶

Custom template for Discord webhook formatting:

# /etc/alertmanager/templates/discord.tmpl
{{ define "discord.message" }}
{{ if eq .Status "firing" }}
**{{ .CommonAnnotations.summary }}**

{{ range .Alerts }}
- **Alert:** {{ .Labels.alertname }}
- **Severity:** {{ .Labels.severity }}
- **Namespace:** {{ .Labels.namespace }}
- **Description:** {{ .Annotations.description }}
- **Started:** {{ .StartsAt.Format "2006-01-02 15:04:05 MST" }}

{{ end }}
{{ else }}
**RESOLVED: {{ .CommonAnnotations.summary }}**
{{ end }}

<@YOUR_DISCORD_USER_ID>
{{ end }}

User Mention

Replace <@YOUR_DISCORD_USER_ID> with your actual Discord user ID to receive @mentions. Find your ID by enabling Developer Mode in Discord settings, then right-click your name and "Copy ID".

Alert Rules¶

Infrastructure Alerts¶

groups:
- name: infrastructure
  rules:
  # Node down - Critical
  - alert: NodeDown
    expr: up{job="node-exporter"} == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Node {{ $labels.instance }} is DOWN"
      description: "Node {{ $labels.instance }} has been unreachable for more than 2 minutes."

  # High CPU usage
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage is {{ printf \"%.2f\" $value }}% on {{ $labels.instance }}."

  # High memory usage
  - alert: HighMemoryUsage
    expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"
      description: "Memory usage is {{ printf \"%.2f\" $value }}% on {{ $labels.instance }}."

  # Disk space low
  - alert: DiskSpaceLow
    expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 > 85
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Disk space low on {{ $labels.instance }}"
      description: "Disk usage is {{ printf \"%.2f\" $value }}% on {{ $labels.mountpoint }}."

  # Disk space critical
  - alert: DiskSpaceCritical
    expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 > 95
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "CRITICAL: Disk almost full on {{ $labels.instance }}"
      description: "Disk usage is {{ printf \"%.2f\" $value }}% on {{ $labels.mountpoint }}. Immediate action required!"

Kubernetes Alerts¶

- name: kubernetes
  rules:
  # Pod crash looping
  - alert: PodCrashLooping
    expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
      description: "Pod has restarted {{ $value }} times in the last hour."

  # Pod not ready
  - alert: PodNotReady
    expr: kube_pod_status_phase{phase=~"Pending|Unknown"} == 1
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready"
      description: "Pod has been in {{ $labels.phase }} state for more than 15 minutes."

  # Deployment replicas mismatch
  - alert: DeploymentReplicasMismatch
    expr: kube_deployment_status_replicas_available != kube_deployment_spec_replicas
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has mismatched replicas"
      description: "Expected {{ $value }} replicas but got different count."

  # StatefulSet replicas mismatch
  - alert: StatefulSetReplicasMismatch
    expr: kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} not ready"

  # PVC pending
  - alert: PersistentVolumeClaimPending
    expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending"

Nextcloud Alerts¶

- name: nextcloud
  rules:
  # Nextcloud down
  - alert: NextcloudDown
    expr: up{job="nextcloud"} == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Nextcloud is DOWN"
      description: "Nextcloud metrics endpoint is not responding."

  # Nextcloud storage low
  - alert: NextcloudStorageLow
    expr: nextcloud_free_space_bytes < 107374182400
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Nextcloud storage below 100GB"
      description: "Free space: {{ $value | humanize1024 }}B"

  # Nextcloud storage critical
  - alert: NextcloudStorageCritical
    expr: nextcloud_free_space_bytes < 21474836480
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "CRITICAL: Nextcloud storage below 20GB"
      description: "Free space: {{ $value | humanize1024 }}B. Immediate action required!"

  # No active users (possible issue)
  - alert: NextcloudNoActiveUsers
    expr: nextcloud_active_users_total == 0
    for: 1h
    labels:
      severity: info
    annotations:
      summary: "No active Nextcloud users in the last hour"

Application Alerts¶

- name: applications
  rules:
  # High error rate
  - alert: HighErrorRate
    expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job) > 0.05
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High error rate in {{ $labels.job }}"
      description: "Error rate is {{ printf \"%.2f\" $value }}%"

  # Slow responses
  - alert: SlowResponses
    expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Slow response times detected"
      description: "95th percentile latency is {{ printf \"%.2f\" $value }}s"

  # Service down
  - alert: ServiceDown
    expr: up == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Service {{ $labels.job }} is DOWN"
      description: "{{ $labels.instance }} is not responding to scrapes."

Longhorn Storage Alerts¶

- name: longhorn
  rules:
  # Volume degraded
  - alert: LonghornVolumeDegraded
    expr: longhorn_volume_robustness == 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Longhorn volume {{ $labels.volume }} is degraded"

  # Volume faulted
  - alert: LonghornVolumeFaulted
    expr: longhorn_volume_robustness == 3
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "CRITICAL: Longhorn volume {{ $labels.volume }} is faulted"

  # Node storage low
  - alert: LonghornNodeStorageLow
    expr: (longhorn_node_storage_capacity_bytes - longhorn_node_storage_usage_bytes) / longhorn_node_storage_capacity_bytes < 0.15
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Longhorn node {{ $labels.node }} storage below 15%"

Alert Severities¶

Severity	Response Time	Examples	Discord Behavior
Critical	Immediate	Node down, service outage, disk full	@mention, repeat every 1h
Warning	Within hours	High resource usage, degraded	@mention, repeat every 4h
Info	Next business day	Maintenance, non-urgent	No mention, repeat every 12h

Managing Alerts¶

Silencing Alerts¶

Create silences during maintenance windows:

# Via kubectl
kubectl -n monitoring exec -it prometheus-alertmanager-0 -- \
  amtool silence add alertname="NodeDown" instance="k8s-worker-5" \
  --duration=2h --comment="Scheduled maintenance"

# List active silences
kubectl -n monitoring exec -it prometheus-alertmanager-0 -- amtool silence query

# Expire a silence
kubectl -n monitoring exec -it prometheus-alertmanager-0 -- amtool silence expire <silence-id>

Testing Alerts¶

# Send test alert to Alertmanager
curl -X POST http://localhost:9093/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {
      "alertname": "TestAlert",
      "severity": "warning"
    },
    "annotations": {
      "summary": "This is a test alert",
      "description": "Testing Discord webhook integration"
    }
  }]'

Kibana Alerts¶

Log-Based Alerts¶

Kibana can also send alerts based on log patterns:

Go to Stack Management → Rules
Create Elasticsearch query rule
Define query conditions (e.g., error count > threshold)
Set check interval
Configure Discord webhook action

Example: Error Spike Alert¶

{
  "rule_type": "es_query",
  "query": {
    "bool": {
      "must": [
        {"match": {"log.level": "error"}},
        {"range": {"@timestamp": {"gte": "now-5m"}}}
      ]
    }
  },
  "threshold": 50,
  "time_window": "5m",
  "actions": [{
    "type": "webhook",
    "url": "<DISCORD_WEBHOOK_URL>"
  }]
}

Best Practices¶

Avoid alert fatigue - Only alert on actionable items
Use severity levels - Prioritize response appropriately
Include runbook links - Provide resolution steps in annotations
Set appropriate thresholds - Tune to avoid false positives
Group related alerts - Reduce notification noise
Test alerts regularly - Verify they fire correctly
Document escalation - Know who to contact for what
Review and prune - Remove stale or noisy alerts

Troubleshooting¶

Check Alertmanager Status¶

kubectl -n monitoring get pods -l app.kubernetes.io/name=alertmanager
kubectl -n monitoring logs -l app.kubernetes.io/name=alertmanager

Verify Webhook Connectivity¶

kubectl -n monitoring exec -it prometheus-alertmanager-0 -- \
  wget -q -O- http://discord.com --spider

View Active Alerts¶

Access Alertmanager UI at http://alertmanager.monitoring.svc:9093 or via port-forward:

kubectl -n monitoring port-forward svc/prometheus-alertmanager 9093:9093