Target: clear, short, tool‑agnostic. Stack examples: Dynatrace, Grafana/Prometheus/Loki, Splunk, OpenTelemetry.
Purpose
Know what’s broken, why, and how to fix it before users notice. Tie signals to business SLOs (availability, latency) and make teams accountable.
What to watch (the essentials)
Golden signals for every service:
- Latency: p50/p95 request time (and DB time).
- Traffic: RPS/QPS, concurrency, queue depth.
- Errors: 5xx, timeouts, failed jobs.
- Saturation: CPU, memory, disk IO, network, pod restarts.
Plus KOB health: - Control plane: API server, etcd, scheduler, controller.
- Node health: pressure (CPU/mem/disk), kubelet, CNI.
- Deploy health: readiness/liveness failures, crash loops.
Tool roles (minimal mental model)
- Dynatrace: deep APM (code‑level traces), RUM, Davis AI for anomaly detection, Service maps.
- Grafana + Prometheus: real‑time metrics, SLO burn‑rate, capacity dashboards, Alertmanager.
- Loki (or ELK): app and platform logs, quick correlation with labels (namespace, pod, container).
- Splunk: log archive, compliance, security analytics/SIEM, long‑term investigations.
- OpenTelemetry: one collector for metrics/logs/traces → route to Dynatrace/Grafana/Splunk without agent sprawl.
Reference pattern (simple)
- Instrument apps with OpenTelemetry SDK/auto‑instrumentation.
- Sidecars/DaemonSets: node exporter, kube‑state‑metrics, cAdvisor.
- Prometheus scrapes metrics → Grafana dashboards & Alertmanager.
- Traces → Dynatrace (APM); sample to Tempo if needed.
- Logs → Loki for ops + Splunk for retention/SIEM.
- SLOs in Grafana (or Dynatrace SLOs) with burn‑rate alerts.
Alerts that matter (keep it tight)
- SLO burn‑rate: 1h/6h windows for availability & latency.
- Error rate: >X% for Y minutes by service.
- CrashLoopBackOff spike or restart storm.
- Pod unschedulable (node pressure/taints) and PDB violations.
- etcd/Apiserver latency and leader health.
- Storage: PVC pending, volume errors, high fs usage.
Noise‑killers: alert on symptoms, not every container event; route detail to runbooks.
Dashboards to publish (one page each)
- Exec SLO view: uptime, latency percentiles, top failing services, user impact.
- Service owner: golden signals, dependency map, recent deploys vs errors.
- Platform: node capacity, control‑plane health, CNI/CSI status, queue backlogs.
- Cost/capacity: CPU/mem requests vs usage, idle %, hotspot namespaces.
Operating model
- Ownership: platform team owns cluster SLO; app teams own service SLOs and on‑call.
- Everything as code: dashboards, alerts, SLOs in Git. PR = change.
- Runbooks: for top incidents (deploy rollback, node drain, PVC restore, apiserver degraded).
- Drills: quarterly chaos/game days; measure MTTR.
Rollout in 5 moves
- Define 2–3 business SLOs; map services to them.
- Ship OpenTelemetry and baseline scrapes/logs.
- Stand up Grafana/Prometheus/Loki; wire Dynatrace traces; archive to Splunk.
- Create the four dashboards; add burn‑rate alerts.
- Do one game day; fix gaps; repeat monthly.
Bottom line
If you can see golden signals + traces per service, control‑plane health, and have SLO burn‑rate alerts with clean runbooks, you’re observability‑ready for KOB.













