Skip to content

Monitoring Stack Overview

Architecture

The monitoring stack provides comprehensive observability across metrics, logs, and alerting.

graph TB
    subgraph Data Collection
        PA[Prometheus]
        EA[Elastic Agent]
    end

    subgraph Storage
        TSDB[(Prometheus TSDB)]
        ES[(Elasticsearch)]
    end

    subgraph Visualization
        Grafana[Grafana]
        Kibana[Kibana]
    end

    subgraph Alerting
        AM[Alertmanager]
    end

    Apps[Applications] -->|metrics| PA
    Apps -->|logs| EA
    PA --> TSDB
    EA --> ES
    TSDB --> Grafana
    ES --> Kibana
    PA --> AM

Components

Component Purpose Port
Prometheus Metrics collection 9090
Grafana Metrics visualization 3000
Alertmanager Alert routing 9093
Elasticsearch Log storage 9200
Kibana Log analysis 5601
Elastic Agent Log collection 8220

Metrics Pipeline

Collection

  1. Applications expose /metrics endpoint
  2. Prometheus scrapes targets at configured intervals
  3. Metrics stored in time-series database

Visualization

  1. Grafana connects to Prometheus datasource
  2. Dashboards query and display metrics
  3. Variables enable dynamic filtering

Alerting

  1. Prometheus evaluates alert rules
  2. Firing alerts sent to Alertmanager
  3. Alertmanager routes to notification channels

Logging Pipeline

Collection

  1. Elastic Agent deployed as DaemonSet
  2. Collects container logs, system logs
  3. Ships to Elasticsearch

Processing

  1. Ingest pipelines parse log formats
  2. Enrichment adds metadata
  3. Data indexed by date

Analysis

  1. Kibana provides search interface
  2. Dashboards aggregate log data
  3. Alerts trigger on log patterns

Access URLs

Service URL
Grafana grafana.ajandrews.pro
Kibana kibana.ajandrews.pro
Prometheus prometheus.ajandrews.pro

Key Metrics

Infrastructure

  • Node CPU/memory usage
  • Pod resource consumption
  • Network I/O
  • Storage capacity

Applications

  • Request rate and latency
  • Error rates
  • Queue depths
  • Connection pools

Databases

  • Query performance
  • Connection counts
  • Replication lag
  • Cache hit rates

Retention Policies

Data Type Retention
Prometheus metrics 15 days
Elasticsearch logs 30 days
Alertmanager state 120 hours

Dashboards

Infrastructure Dashboards

  • Kubernetes Cluster Overview
  • Node Resource Usage
  • Pod Resource Consumption
  • Storage Capacity

Application Dashboards

  • Hub API Performance
  • Nginx Request Metrics
  • Database Connections

Security Dashboards

  • Threat Intelligence
  • Network Security
  • Failed Authentication