The Three Pillars of Observability
The observability community has converged on three signals: metrics, logs, and traces. Each serves a different purpose in incident diagnosis:
Metrics tell you that something is wrong. CPU at 95%, error rate at 5%, p99 latency at 10 seconds. These are your alerting signals.
Logs help you understand what happened. Request-level context, error messages, stack traces. These answer "what exactly failed?"
Distributed traces show you where time was spent in a request that touched multiple services. Essential for microservice architectures.
Getting all three right is hard. Here's what our setup looked like at TouchNote.
Metrics with Grafana + Prometheus
Prometheus scraped metrics from Kubernetes pods (via annotations), node_exporter for host metrics, and custom application metrics we exposed via Prometheus client libraries. Grafana visualised them.
Key dashboards we built:
- ·Cluster overview: Node CPU/memory/disk, pod restart counts, deployment status
- ·Application performance: Request rate, error rate, latency percentiles (p50, p95, p99) — the RED method
- ·Business metrics: Orders per minute, payment success rate, fulfilment queue depth
The business metric dashboards were the most valuable during incidents. A drop in orders per minute is often detectable before the underlying technical metric alerts fire — because traffic can be flowing fine while a downstream dependency is silently failing.
Application Performance Monitoring with NewRelic
NewRelic's APM agent gave us:
- ·Transaction traces: Full request traces with time spent per function/database query
- ·Error analytics: Grouped by error type, with stack traces and frequency
- ·Apdex score: A standardised satisfaction score that combines response time and error rate into a single signal useful for stakeholder communication
NewRelic's query language (NRQL) let us answer ad-hoc questions quickly: "What percentage of checkout requests in the last hour had a response time over 2 seconds?" These ad-hoc queries were invaluable for post-incident investigation.
Alerting Philosophy
The worst outcome is alert fatigue — when the number of firing alerts is so high that engineers stop trusting the system. We applied strict rules:
- 1.Every alert must be actionable. If you don't know what to do when an alert fires, the alert shouldn't exist.
- 2.Alert on symptoms, not causes. Alert on high error rate (symptom), not high database CPU (cause). The cause might vary; the symptom is what affects users.
- 3.P1 alerts wake people up. Everything else can wait for business hours. Be ruthless about what actually warrants a 3am call.
We maintained a monthly alert review — any alert that fired more than 10 times without resulting in a meaningful action was either tuned or deleted.
The Human Element
None of these tools matter without runbooks. Every alert should link to a runbook: what this alert means, what to check first, common causes, remediation steps. Writing runbooks forces you to think clearly about your systems. And when an incident happens at 2am, the ability to follow a runbook rather than improvise is invaluable.