Prometheus
Overview
Prometheus is the core metrics collection and alerting system, scraping metrics from all instrumented applications and infrastructure components.
Architecture
graph LR
subgraph Targets
N[Node Exporter]
K[kube-state-metrics]
A[Applications]
end
subgraph Prometheus
S[Scraper]
TSDB[(TSDB)]
R[Rule Engine]
end
subgraph Outputs
G[Grafana]
AM[Alertmanager]
end
N --> S
K --> S
A --> S
S --> TSDB
TSDB --> G
R --> AM
Configuration
Scrape Config
global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
ServiceMonitor
For applications using ServiceMonitor CRDs:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: my-namespace
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
Key Metrics
Kubernetes Metrics
| Metric |
Description |
kube_pod_status_phase |
Pod phase status |
kube_deployment_status_replicas |
Deployment replica count |
kube_node_status_condition |
Node condition status |
Node Metrics
| Metric |
Description |
node_cpu_seconds_total |
CPU usage |
node_memory_MemAvailable_bytes |
Available memory |
node_filesystem_avail_bytes |
Available disk space |
Container Metrics
| Metric |
Description |
container_cpu_usage_seconds_total |
Container CPU |
container_memory_working_set_bytes |
Container memory |
container_network_receive_bytes_total |
Network RX |
Alerting Rules
Critical Alerts
groups:
- name: critical
rules:
- alert: NodeDown
expr: up{job="node"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} down"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 15m
labels:
severity: warning
Warning Alerts
- alert: HighMemoryUsage
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
Storage
Retention
- Default: 15 days
- Storage path:
/prometheus/data
- Block duration: 2 hours
Capacity Planning
| Metrics |
Samples/sec |
Storage/day |
| 10,000 |
~333 |
~500MB |
| 50,000 |
~1,666 |
~2.5GB |
| 100,000 |
~3,333 |
~5GB |
Queries (PromQL)
Common Queries
# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total{namespace="hub"}[5m])) by (pod)
# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# HTTP request rate
sum(rate(http_requests_total[5m])) by (handler, status)
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
API Endpoints
| Endpoint |
Description |
/api/v1/query |
Instant query |
/api/v1/query_range |
Range query |
/api/v1/targets |
Scrape targets |
/api/v1/alerts |
Active alerts |
Troubleshooting
Target Not Scraped
- Check ServiceMonitor labels match
- Verify metrics endpoint accessible
- Check Prometheus logs for errors
High Cardinality
- Review label usage
- Use recording rules
- Reduce scrape targets