Monitoring Stack Overview
Architecture
The monitoring stack provides comprehensive observability across metrics, logs, and alerting.
graph TB
subgraph Data Collection
PA[Prometheus]
EA[Elastic Agent]
end
subgraph Storage
TSDB[(Prometheus TSDB)]
ES[(Elasticsearch)]
end
subgraph Visualization
Grafana[Grafana]
Kibana[Kibana]
end
subgraph Alerting
AM[Alertmanager]
end
Apps[Applications] -->|metrics| PA
Apps -->|logs| EA
PA --> TSDB
EA --> ES
TSDB --> Grafana
ES --> Kibana
PA --> AM
Components
| Component |
Purpose |
Port |
| Prometheus |
Metrics collection |
9090 |
| Grafana |
Metrics visualization |
3000 |
| Alertmanager |
Alert routing |
9093 |
| Elasticsearch |
Log storage |
9200 |
| Kibana |
Log analysis |
5601 |
| Elastic Agent |
Log collection |
8220 |
Metrics Pipeline
Collection
- Applications expose
/metrics endpoint
- Prometheus scrapes targets at configured intervals
- Metrics stored in time-series database
Visualization
- Grafana connects to Prometheus datasource
- Dashboards query and display metrics
- Variables enable dynamic filtering
Alerting
- Prometheus evaluates alert rules
- Firing alerts sent to Alertmanager
- Alertmanager routes to notification channels
Logging Pipeline
Collection
- Elastic Agent deployed as DaemonSet
- Collects container logs, system logs
- Ships to Elasticsearch
Processing
- Ingest pipelines parse log formats
- Enrichment adds metadata
- Data indexed by date
Analysis
- Kibana provides search interface
- Dashboards aggregate log data
- Alerts trigger on log patterns
Access URLs
Key Metrics
Infrastructure
- Node CPU/memory usage
- Pod resource consumption
- Network I/O
- Storage capacity
Applications
- Request rate and latency
- Error rates
- Queue depths
- Connection pools
Databases
- Query performance
- Connection counts
- Replication lag
- Cache hit rates
Retention Policies
| Data Type |
Retention |
| Prometheus metrics |
15 days |
| Elasticsearch logs |
30 days |
| Alertmanager state |
120 hours |
Dashboards
Infrastructure Dashboards
- Kubernetes Cluster Overview
- Node Resource Usage
- Pod Resource Consumption
- Storage Capacity
Application Dashboards
- Hub API Performance
- Nginx Request Metrics
- Database Connections
Security Dashboards
- Threat Intelligence
- Network Security
- Failed Authentication