Golang Step By Step | Learn Go + System Design

# The Three Pillars

Observability answers: "Why is the system behaving this way?"It's not just monitoring (knowing that something is wrong) — it's about understanding why.

Pillar	What	Tools
Logs	Discrete events with context	ELK, Loki, CloudWatch
Metrics	Numeric measurements over time	Prometheus, Datadog, CloudWatch
Traces	End-to-end request journey	Jaeger, Zipkin, X-Ray, Tempo

# Distributed Tracing

A single user request might touch 10+ services. Tracing follows that journey, showing exactly where time is spent.

Trace: user clicks "Buy Now"
TraceID: abc-123-def

├── [API Gateway]    0ms ─────────────────────── 250ms
│   ├── [Auth Service]     5ms ──── 15ms
│   ├── [Order Service]    20ms ──────────────── 230ms
│   │   ├── [Inventory DB]   25ms ───── 45ms
│   │   ├── [Payment Svc]    50ms ────────── 180ms   ← BOTTLENECK
│   │   │   └── [Stripe API]   55ms ────── 170ms
│   │   └── [Email Queue]    185ms ─ 190ms
│   └── [Response]          235ms ─ 250ms

Each bar = a "span" (service + operation + duration)
Spans are nested (parent → child)
trace_id propagated via HTTP header: traceparent

# SLOs, SLAs & Error Budgets

SLOs turn "the system should be reliable" into a measurable target.

SLI (Indicator):  "99.2% of requests complete in < 200ms"
                   ↓ (want to achieve)
SLO (Objective):  "99.5% of requests must complete in < 200ms"
                   ↓ (promise to customers)
SLA (Agreement):  "99.9% uptime, or we refund 10%"

Error Budget = 100% - SLO
  SLO: 99.9% → Error budget: 0.1% → 43 min downtime/month

Error budget exhausted?
  → Freeze features, focus on reliability
  → This aligns engineering velocity with reliability goals

# Alerting That Doesn't Suck

Bad alerting: "CPU over 80%" at 3 AM for a non-issue. Good alerting: symptom-based, tied to SLOs, with clear runbooks.

Alert on symptoms, not causes — Alert: "error rate > 1%" not "CPU > 80%"
Use burn rate alerts — How fast are we spending our error budget?
Every alert needs a runbook — What to check, who to escalate to
Reduce noise — If you ignore an alert repeatedly, fix or delete it

⚡ Key Takeaways

Three pillars: Logs (what happened), Metrics (how much), Traces (where)
Use RED (Rate, Errors, Duration) for services and USE (Utilization, Saturation, Errors) for resources
SLOs + error budgets align feature velocity with reliability goals
Distributed tracing is essential for debugging microservice latency
OpenTelemetry is the vendor-neutral standard — learn it once, export anywhere