Monitoring SOP¶

Version: 1.0
Last Updated: 2024-01-01
Owner: DevOps Team

Purpose¶

Define standards for monitoring infrastructure, applications, and alerting to ensure system health and reliability.

Component	Tool	Purpose
Metrics	Prometheus	Time-series data collection
Visualization	Grafana	Dashboards and charts
Logging	Loki / ELK	Centralized log aggregation
Tracing	Jaeger	Distributed request tracing
Alerting	Alertmanager	Alert routing and notification
Uptime	Uptime Robot	External health checks

Metric	Warning	Critical	Description
CPU Usage	> 80% for 5m	> 95% for 2m	Node CPU utilization
Memory	> 85%	> 95%	Node memory utilization
Disk	> 80%	> 90%	Root disk usage
Service status	Unhealthy	Down	Service health check

Metric	Warning	Critical	Description
5xx errors	> 1% for 5m	> 5% for 2m	Server errors
Latency p99	> 500ms	> 2s	Request latency
Health check	—	3 consecutive fails	Service health

Escalation

If primary does not acknowledge within 5 minutes, secondary is paged. If both miss, the incident escalates to the DevOps lead.