# The Three Pillars
Observability answers: "Why is the system behaving this way?"It's not just monitoring (knowing that something is wrong) — it's about understanding why.
| Pillar | What | Tools |
|---|---|---|
| Logs | Discrete events with context | ELK, Loki, CloudWatch |
| Metrics | Numeric measurements over time | Prometheus, Datadog, CloudWatch |
| Traces | End-to-end request journey | Jaeger, Zipkin, X-Ray, Tempo |
# Distributed Tracing
A single user request might touch 10+ services. Tracing follows that journey, showing exactly where time is spent.
Trace: user clicks "Buy Now" TraceID: abc-123-def ├── [API Gateway] 0ms ─────────────────────── 250ms │ ├── [Auth Service] 5ms ──── 15ms │ ├── [Order Service] 20ms ──────────────── 230ms │ │ ├── [Inventory DB] 25ms ───── 45ms │ │ ├── [Payment Svc] 50ms ────────── 180ms ← BOTTLENECK │ │ │ └── [Stripe API] 55ms ────── 170ms │ │ └── [Email Queue] 185ms ─ 190ms │ └── [Response] 235ms ─ 250ms Each bar = a "span" (service + operation + duration) Spans are nested (parent → child) trace_id propagated via HTTP header: traceparent
# SLOs, SLAs & Error Budgets
SLOs turn "the system should be reliable" into a measurable target.
SLI (Indicator): "99.2% of requests complete in < 200ms"
↓ (want to achieve)
SLO (Objective): "99.5% of requests must complete in < 200ms"
↓ (promise to customers)
SLA (Agreement): "99.9% uptime, or we refund 10%"
Error Budget = 100% - SLO
SLO: 99.9% → Error budget: 0.1% → 43 min downtime/month
Error budget exhausted?
→ Freeze features, focus on reliability
→ This aligns engineering velocity with reliability goals# Alerting That Doesn't Suck
Bad alerting: "CPU over 80%" at 3 AM for a non-issue. Good alerting: symptom-based, tied to SLOs, with clear runbooks.
- Alert on symptoms, not causes — Alert: "error rate > 1%" not "CPU > 80%"
- Use burn rate alerts — How fast are we spending our error budget?
- Every alert needs a runbook — What to check, who to escalate to
- Reduce noise — If you ignore an alert repeatedly, fix or delete it
⚡ Key Takeaways
- Three pillars: Logs (what happened), Metrics (how much), Traces (where)
- Use RED (Rate, Errors, Duration) for services and USE (Utilization, Saturation, Errors) for resources
- SLOs + error budgets align feature velocity with reliability goals
- Distributed tracing is essential for debugging microservice latency
- OpenTelemetry is the vendor-neutral standard — learn it once, export anywhere