observability

Oncall isn't supposed to be this hard

Well-configured alerts connect the symptom to traces, logs, deploys, and the suspect commit.

Kirti Rathore··6 min read
Oncall isn't supposed to be this hard
Oncall isn't supposed to be this hard

Bad Prometheus alerts tell an oncall engineer something is wrong, while good alerts connect the symptom to traces, logs, deploys, and the suspect commit.

That distinction sounds small until you're on-call and an alert storm appears.

You open one of the alerts and see:

[CRITICAL] CheckoutHighErrorRate - 7.3% 5xx in prod-eu-west-1

The alert is not wrong. Checkout is returning 500s. But it hasn't told you which host/container/VM to start investigating from.

let the wild hunt begin

The SRE / Developer now has all the work to do.

If you know what you're doing, you first check the alert definition.

A basic Prometheus setup usually looks like this:

- alert: CheckoutHighErrorRate
  expr: |
    sum(rate(http_requests_total{service="checkout",status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total{service="checkout"}[5m])) > 0.05
  for: 10m
  labels:
    severity: critical
    team: payments
    service: checkout
  annotations:
    summary: "Checkout 5xx rate is above 5%"
    runbook_url: "https://runbooks.corp/payments/checkout-5xx"
    dashboard_url: "https://grafana.corp/d/checkout"

This gives you some important pieces of information:

  • The alert aggregates HTTP errors over a 5 minute period and compares it to a threshold.
  • The alert is owned by the Payments team.
  • There is a playbook you can start from.
Prometheus expression browser showing the raw query behind the checkout high error rate alert.
The raw PromQL graph shows why the alert fired but doesn't give much more context.

But there is no reason to celebrate just yet.

The real work begins now.

Adjust the time window to within 5 minutes of the alert time.

  • Open Grafana and check if the dashboards have any extra information.
  • Open Loki and write a query like {service_name="checkout"} | = "500".
  • Open Tempo and filter traces by time. Guess which trace represents the incident.
  • Open your CD pipeline and search for any deploys just before the alert.

At some point, several possible hypothesis appear.

Big newly introduced feature in the checkout-api@v2.4.1 looks fishy.
High CPU usage on 3 out of 5 hosts that reported 5xx errors.
Suspicious I/O errors on all the investigated hosts.
Slow DB transactions.

Eventually the developer manages to reconstruct context across four tools, in about an hour if they know exactly what they're doing.

Meanwhile, there may be other fresh alerts to investigate.

good alerts tell you where to start looking

The same stack can behave very differently.

Not a different vendor. Not a more expensive alerting product.

The same stack, wired correctly to bubble up context.

Here is what it would look like for the Prometheus/Grafana/Tempo/Loki stack:

  -> Prometheus exporter using OpenTelemetry SDK.
  -> histograms correlated with trace spans.
  -> Grafana exemplars enabled.
  -> Tempo setup with trace-to-logs enabled.
  -> deploy marker / service.version / commit SHA added as metadata with each alert.

The alert still starts with a metric. It should. Metrics are how you detect the symptom.

But the metric now carries a breadcrumb to a specific request.

Grafana latency histogram with exemplar diamonds linking the high p95 bucket to a Tempo trace.
Exemplars are the bridge from an aggregate bucket to a specific slow request.

Prometheus alerts do not naturally carry a trace_id. A histogram bucket is an aggregate. It is not a request.

Exemplars change that. A sampled measurement can attach the active trace_id to the bucket. Grafana can render that as a clickable diamond. Click it and Tempo opens the representative trace.

Grafana Tempo trace waterfall showing the slow database span, span attributes, logs tab, and service version.
The trace shows the slow span and the context attached to it: database statement, feature flag, user, and service version.

In the good version, the selected span says:

service: db-primary
operation: SELECT orders WHERE user_id=$1
duration: 1210ms
db.rows_affected: 1110482
feature_flag.new_checkout: true
service.version: 2.4.1

We see the slow database queries in the distributed trace .

Then Tempo's trace-to-logs link opens Loki for the exact trace.

Grafana Explore with Loki logs filtered by trace_id for the checkout trace.
Trace-to-logs only works if logs carry the same trace identifier.

The log line is not buried in a time-window query anymore:

slow query: seq scan on orders (1.1M rows), index not used
trace_id=4bf92f3577b34da6a3ce929d0e0e4736
span_id=00f067aa0ba902b7
service.version=2.4.1
commit=7a3f9c2

Now the hypothesis is no longer vague.

checkout-api@v2.4.1 added the new order-history query path.
The user-id column needs to be added as an index.
The bad path is gated by feature_flag.new_checkout=true.
Disable the flag or roll back 7a3f9c2.

the configuration is what makes the oncall experience fun

None of this is automatic and doesn't come automatically, whether you are using Prometheus + Grafana, Datadog, or New Relic.

The good path needs deliberate plumbing:

  • Page on symptoms: error rate, latency, traffic, saturation, or SLO burn.
  • Put team, service, severity, runbook_url, and a scoped dashboard_url on the alert.
  • Propagate W3C trace context through every service.
  • Inject trace_id and span_id into structured logs.
  • Enable exemplars on the histogram used by the alert.
  • Configure Grafana so exemplars open Tempo.
  • Configure Tempo trace-to-logs so spans open Loki.
  • Emit service.version, deploy annotations, and commit SHA from CI/CD.

You can play around with such a well-configured setup here.

AI SRE

The useful AI SRE workflow starts after the observability stack has preserved the evidence trail. The agent can help with root cause analysis, propose the fix, and validate the patch. But if the alert drops the trace, the log correlation, and the deploy context, the agent has the same problem the human does: it is guessing.

For where we are taking this in the product, see FixBugs and AI SRE tools.

references worth reading