Alerting¶
Overview¶
Alerting is handled by Prometheus Alertmanager for metric-based alerts, with notifications sent to Discord via webhooks. All critical infrastructure and application alerts are routed to Discord with user mentions for immediate attention.
Architecture¶
graph LR
subgraph Sources
P[Prometheus]
K[Kibana]
end
subgraph Routing
AM[Alertmanager]
end
subgraph Notifications
D[Discord Webhook]
end
P -->|Alert Rules| AM
K -->|Watcher| AM
AM -->|@mention| D
Discord Integration¶
Alertmanager Discord Configuration¶
Alertmanager is configured to send alerts to Discord via webhook. The webhook URL is stored in a Kubernetes secret:
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-discord
namespace: monitoring
type: Opaque
stringData:
webhook-url: "<DISCORD_WEBHOOK_URL>"
Alertmanager Config¶
global:
resolve_timeout: 5m
route:
receiver: 'discord-alerts'
group_by: ['alertname', 'namespace', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts - immediate notification
- match:
severity: critical
receiver: 'discord-critical'
repeat_interval: 1h
continue: false
# Warning alerts
- match:
severity: warning
receiver: 'discord-alerts'
repeat_interval: 4h
# Nextcloud specific alerts
- match:
namespace: nextcloud
receiver: 'discord-alerts'
group_by: ['alertname']
receivers:
- name: 'discord-alerts'
webhook_configs:
- url_file: '/etc/alertmanager/secrets/webhook-url'
send_resolved: true
http_config:
follow_redirects: true
- name: 'discord-critical'
webhook_configs:
- url_file: '/etc/alertmanager/secrets/webhook-url'
send_resolved: true
http_config:
follow_redirects: true
templates:
- '/etc/alertmanager/templates/*.tmpl'
Discord Message Template¶
Custom template for Discord webhook formatting:
# /etc/alertmanager/templates/discord.tmpl
{{ define "discord.message" }}
{{ if eq .Status "firing" }}
**{{ .CommonAnnotations.summary }}**
{{ range .Alerts }}
- **Alert:** {{ .Labels.alertname }}
- **Severity:** {{ .Labels.severity }}
- **Namespace:** {{ .Labels.namespace }}
- **Description:** {{ .Annotations.description }}
- **Started:** {{ .StartsAt.Format "2006-01-02 15:04:05 MST" }}
{{ end }}
{{ else }}
**RESOLVED: {{ .CommonAnnotations.summary }}**
{{ end }}
<@YOUR_DISCORD_USER_ID>
{{ end }}
User Mention
Replace <@YOUR_DISCORD_USER_ID> with your actual Discord user ID to receive @mentions. Find your ID by enabling Developer Mode in Discord settings, then right-click your name and "Copy ID".
Alert Rules¶
Infrastructure Alerts¶
groups:
- name: infrastructure
rules:
# Node down - Critical
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is DOWN"
description: "Node {{ $labels.instance }} has been unreachable for more than 2 minutes."
# High CPU usage
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ printf \"%.2f\" $value }}% on {{ $labels.instance }}."
# High memory usage
- alert: HighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ printf \"%.2f\" $value }}% on {{ $labels.instance }}."
# Disk space low
- alert: DiskSpaceLow
expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Disk usage is {{ printf \"%.2f\" $value }}% on {{ $labels.mountpoint }}."
# Disk space critical
- alert: DiskSpaceCritical
expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 > 95
for: 5m
labels:
severity: critical
annotations:
summary: "CRITICAL: Disk almost full on {{ $labels.instance }}"
description: "Disk usage is {{ printf \"%.2f\" $value }}% on {{ $labels.mountpoint }}. Immediate action required!"
Kubernetes Alerts¶
- name: kubernetes
rules:
# Pod crash looping
- alert: PodCrashLooping
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 15m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod has restarted {{ $value }} times in the last hour."
# Pod not ready
- alert: PodNotReady
expr: kube_pod_status_phase{phase=~"Pending|Unknown"} == 1
for: 15m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready"
description: "Pod has been in {{ $labels.phase }} state for more than 15 minutes."
# Deployment replicas mismatch
- alert: DeploymentReplicasMismatch
expr: kube_deployment_status_replicas_available != kube_deployment_spec_replicas
for: 15m
labels:
severity: warning
annotations:
summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has mismatched replicas"
description: "Expected {{ $value }} replicas but got different count."
# StatefulSet replicas mismatch
- alert: StatefulSetReplicasMismatch
expr: kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas
for: 15m
labels:
severity: warning
annotations:
summary: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} not ready"
# PVC pending
- alert: PersistentVolumeClaimPending
expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1
for: 10m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending"
Nextcloud Alerts¶
- name: nextcloud
rules:
# Nextcloud down
- alert: NextcloudDown
expr: up{job="nextcloud"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Nextcloud is DOWN"
description: "Nextcloud metrics endpoint is not responding."
# Nextcloud storage low
- alert: NextcloudStorageLow
expr: nextcloud_free_space_bytes < 107374182400
for: 10m
labels:
severity: warning
annotations:
summary: "Nextcloud storage below 100GB"
description: "Free space: {{ $value | humanize1024 }}B"
# Nextcloud storage critical
- alert: NextcloudStorageCritical
expr: nextcloud_free_space_bytes < 21474836480
for: 5m
labels:
severity: critical
annotations:
summary: "CRITICAL: Nextcloud storage below 20GB"
description: "Free space: {{ $value | humanize1024 }}B. Immediate action required!"
# No active users (possible issue)
- alert: NextcloudNoActiveUsers
expr: nextcloud_active_users_total == 0
for: 1h
labels:
severity: info
annotations:
summary: "No active Nextcloud users in the last hour"
Application Alerts¶
- name: applications
rules:
# High error rate
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate in {{ $labels.job }}"
description: "Error rate is {{ printf \"%.2f\" $value }}%"
# Slow responses
- alert: SlowResponses
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "Slow response times detected"
description: "95th percentile latency is {{ printf \"%.2f\" $value }}s"
# Service down
- alert: ServiceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is DOWN"
description: "{{ $labels.instance }} is not responding to scrapes."
Longhorn Storage Alerts¶
- name: longhorn
rules:
# Volume degraded
- alert: LonghornVolumeDegraded
expr: longhorn_volume_robustness == 2
for: 5m
labels:
severity: warning
annotations:
summary: "Longhorn volume {{ $labels.volume }} is degraded"
# Volume faulted
- alert: LonghornVolumeFaulted
expr: longhorn_volume_robustness == 3
for: 2m
labels:
severity: critical
annotations:
summary: "CRITICAL: Longhorn volume {{ $labels.volume }} is faulted"
# Node storage low
- alert: LonghornNodeStorageLow
expr: (longhorn_node_storage_capacity_bytes - longhorn_node_storage_usage_bytes) / longhorn_node_storage_capacity_bytes < 0.15
for: 10m
labels:
severity: warning
annotations:
summary: "Longhorn node {{ $labels.node }} storage below 15%"
Alert Severities¶
| Severity | Response Time | Examples | Discord Behavior |
|---|---|---|---|
| Critical | Immediate | Node down, service outage, disk full | @mention, repeat every 1h |
| Warning | Within hours | High resource usage, degraded | @mention, repeat every 4h |
| Info | Next business day | Maintenance, non-urgent | No mention, repeat every 12h |
Managing Alerts¶
Silencing Alerts¶
Create silences during maintenance windows:
# Via kubectl
kubectl -n monitoring exec -it prometheus-alertmanager-0 -- \
amtool silence add alertname="NodeDown" instance="k8s-worker-5" \
--duration=2h --comment="Scheduled maintenance"
# List active silences
kubectl -n monitoring exec -it prometheus-alertmanager-0 -- amtool silence query
# Expire a silence
kubectl -n monitoring exec -it prometheus-alertmanager-0 -- amtool silence expire <silence-id>
Testing Alerts¶
# Send test alert to Alertmanager
curl -X POST http://localhost:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {
"alertname": "TestAlert",
"severity": "warning"
},
"annotations": {
"summary": "This is a test alert",
"description": "Testing Discord webhook integration"
}
}]'
Kibana Alerts¶
Log-Based Alerts¶
Kibana can also send alerts based on log patterns:
- Go to Stack Management → Rules
- Create Elasticsearch query rule
- Define query conditions (e.g., error count > threshold)
- Set check interval
- Configure Discord webhook action
Example: Error Spike Alert¶
{
"rule_type": "es_query",
"query": {
"bool": {
"must": [
{"match": {"log.level": "error"}},
{"range": {"@timestamp": {"gte": "now-5m"}}}
]
}
},
"threshold": 50,
"time_window": "5m",
"actions": [{
"type": "webhook",
"url": "<DISCORD_WEBHOOK_URL>"
}]
}
Best Practices¶
- Avoid alert fatigue - Only alert on actionable items
- Use severity levels - Prioritize response appropriately
- Include runbook links - Provide resolution steps in annotations
- Set appropriate thresholds - Tune to avoid false positives
- Group related alerts - Reduce notification noise
- Test alerts regularly - Verify they fire correctly
- Document escalation - Know who to contact for what
- Review and prune - Remove stale or noisy alerts
Troubleshooting¶
Check Alertmanager Status¶
kubectl -n monitoring get pods -l app.kubernetes.io/name=alertmanager
kubectl -n monitoring logs -l app.kubernetes.io/name=alertmanager
Verify Webhook Connectivity¶
kubectl -n monitoring exec -it prometheus-alertmanager-0 -- \
wget -q -O- http://discord.com --spider
View Active Alerts¶
Access Alertmanager UI at http://alertmanager.monitoring.svc:9093 or via port-forward: