Skip to content

Prometheus

Overview

Prometheus is the core metrics collection and alerting system, scraping metrics from all instrumented applications and infrastructure components.

Architecture

graph LR
    subgraph Targets
        N[Node Exporter]
        K[kube-state-metrics]
        A[Applications]
    end

    subgraph Prometheus
        S[Scraper]
        TSDB[(TSDB)]
        R[Rule Engine]
    end

    subgraph Outputs
        G[Grafana]
        AM[Alertmanager]
    end

    N --> S
    K --> S
    A --> S
    S --> TSDB
    TSDB --> G
    R --> AM

Configuration

Scrape Config

global:
  scrape_interval: 30s
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

ServiceMonitor

For applications using ServiceMonitor CRDs:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: my-namespace
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics
    interval: 30s

Key Metrics

Kubernetes Metrics

Metric Description
kube_pod_status_phase Pod phase status
kube_deployment_status_replicas Deployment replica count
kube_node_status_condition Node condition status

Node Metrics

Metric Description
node_cpu_seconds_total CPU usage
node_memory_MemAvailable_bytes Available memory
node_filesystem_avail_bytes Available disk space

Container Metrics

Metric Description
container_cpu_usage_seconds_total Container CPU
container_memory_working_set_bytes Container memory
container_network_receive_bytes_total Network RX

Alerting Rules

Critical Alerts

groups:
- name: critical
  rules:
  - alert: NodeDown
    expr: up{job="node"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Node {{ $labels.instance }} down"

  - alert: PodCrashLooping
    expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
    for: 15m
    labels:
      severity: warning

Warning Alerts

  - alert: HighMemoryUsage
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"

Storage

Retention

  • Default: 15 days
  • Storage path: /prometheus/data
  • Block duration: 2 hours

Capacity Planning

Metrics Samples/sec Storage/day
10,000 ~333 ~500MB
50,000 ~1,666 ~2.5GB
100,000 ~3,333 ~5GB

Queries (PromQL)

Common Queries

# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total{namespace="hub"}[5m])) by (pod)

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# HTTP request rate
sum(rate(http_requests_total[5m])) by (handler, status)

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

API Endpoints

Endpoint Description
/api/v1/query Instant query
/api/v1/query_range Range query
/api/v1/targets Scrape targets
/api/v1/alerts Active alerts

Troubleshooting

Target Not Scraped

  1. Check ServiceMonitor labels match
  2. Verify metrics endpoint accessible
  3. Check Prometheus logs for errors

High Cardinality

  1. Review label usage
  2. Use recording rules
  3. Reduce scrape targets