>_
GolangStepByStep
Software Engineer

Observability Advanced

Metrics, tracing, OpenTelemetry concepts, production monitoring

# What is Observability? (The Dashboard Analogy)

Imagine driving a cheap go-kart. When something breaks, white smoke pours out. You have to park it, take the engine apart piece by piece, and figure out what failed using a flashlight. This is the equivalent of running a web server with no observability.

Observability is like driving a multi-million-dollar Formula 1 car. The dashboard tells you your exact speed (Metrics). The computer logs every gear shift the driver makes (Logs). And if the car slows down by 0.5 seconds on turn 3, sensors trace the exact pressure loss from the pedal, through the hydraulics, to the breaks (Tracing).

If your Go app faces the internet, you are physically blind without these three pillars.

# Level 1: The Three Pillars (Beginner)

To reach expert production level, you must understand when to use which tool. Never use them for the wrong job.

  • 🪵 Logs (The What): Individual records of discrete events.("User 42 failed to login at 10:04 PM"). They are heavy, expensive to store, but the most detailed.
  • 📊 Metrics (The How Much): Aggregated numerical data over time.("Login failed 500 times in the last 10 minutes"). They use almost zero storage space and are perfect for triggering alerts (Paging a developer at 3AM).
  • 🕸️ Traces (The Where): The lifespan of a single request spanning across multiple machines.("The login request took 2.5 seconds. 0.1s in the Go API, 2.3s frozen in the Authentication Microservice, 0.1s in the Database").

# Level 2: Adding Metrics (Intermediate)

Let's add Prometheus Metrics to our Go server. We want to know exactly how many times people visit our website. To do this, we use a Counter.

A Counter is simply an integer residing in the Go Runtime's memory that we increment (add +1) every time our endpoint is hit.

package main

import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

// 1. Define your Metric globally
var (
    requestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests made.",
        },
        []string{"path"}, // We can segment this by the URL path!
    )
)

func init() {
    // 2. Register it so Go exposes it
    prometheus.MustRegister(requestsTotal)
}

func main() {
    // 3. Our actual web handler
    http.HandleFunc("/buy", func(w http.ResponseWriter, r *http.Request) {
        // INCREMENT THE METRIC IN MEMORY! This takes nanoseconds.
        requestsTotal.WithLabelValues("/buy").Inc()
        w.Write([]byte("Item Bought"))
    })

    // 4. Open the /metrics endpoint so Prometheus (or you) can scrape the data!
    http.Handle("/metrics", promhttp.Handler())
    
    http.ListenAndServe(":8080", nil)
}

Try it yourself: If you run this code, hit /buy three times, and then visit http://localhost:8080/metrics in your browser. You'll literally see raw text: http_requests_total{path="/buy"} 3!

# Level 3: Distributed Tracing & Spans (Advanced)

Metrics tell you that your API suddenly takes 5 seconds. But why? Did the database slow down? Is the third-party Stripe API timing out? This is where Traces shine.

In tracing, the main request is the Trace, and every function call inside it is a Span. OpenTelemetry (OTel) is the industry standard Go library for this. OTel attaches Spans directly to Go's context.Context!

import "go.opentelemetry.io/otel"

func CheckoutHandler(w http.ResponseWriter, r *http.Request) {
    tracer := otel.Tracer("checkout-api")

    // 1. Start a Main Span using the HTTP Request Context
    ctx, span := tracer.Start(r.Context(), "ProcessCheckout")
    defer span.End() // This records exactly how long the whole checkout took.

    // 2. Pass THAT context to the database function!
    ChargeCreditCard(ctx, 100)
    
    w.Write([]byte("Done"))
}

func ChargeCreditCard(ctx context.Context, amount int) {
    tracer := otel.Tracer("checkout-api")
    
    // 3. Create a CHILD Span! 
    // Because we passed the context, OTel knows this operation belongs to 'ProcessCheckout'
    _, span := tracer.Start(ctx, "StripeAPICall")
    defer span.End()

    // ... do external HTTP call ...
}

When exported to a tool like Jaeger or Datadog, this generates a beautiful visual waterfall graph showing:
[---- ProcessCheckout 1.2s ----]
        [-- StripeAPICall 1.0s --]

# Level 4: The OpenTelemetry Revolution (Expert)

Ten years ago, Observability in Go was a nightmare.

If your company used Datadog, you had to import "github.com/DataDog/dd-trace-go". If your manager decided Datadog was too expensive and wanted to switch to Honeycomb, you had to literally delete thousands of lines of Datadog code in your repository and rewrite them with the Honeycomb library.

OpenTelemetry (OTel) fixed this forever. It is an open-source standard. You write your Go code strictly using "go.opentelemetry.io/otel".

// 1. Write vendor-neutral code using OTel interfaces
func GenerateReport() {
    // Look mom, no vendor specific code!
    metrics.Counter("reports_generated").Add(1)
}

// 2. Setup your Exporter once at application boot
func InitializeOTel(vendor string) {
    var exporter otel.Exporter
    
    // Changing observability providers is literally a one-line config change now!
    if vendor == "datadog" {
        exporter = oteldatadog.New()
    } else if vendor == "jaeger" {
        exporter = oteljaeger.New()
    }
    
    otel.SetTracerProvider(trace.NewTracerProvider(
        trace.WithBatcher(exporter),
    ))
}

As an expert Go developer, you should almost exclusively reach for OpenTelemetry for metrics and tracing in new projects. It strictly forces vendor neutrality at the code level.

practice & review