Building an Observability Stack from Scratch

Observability is not monitoring with a fancier name. Monitoring tells you when something is broken. Observability tells you why. After building observability platforms for production environments spanning 50+ microservices, I've learned that the difference between a team that resolves incidents in 5 minutes and one that takes 2 hours often comes down to how well their observability stack is designed.

The Three Pillars

A production-grade observability stack rests on three pillars: metrics, logs, and traces. Each serves a different purpose, and you need all three to diagnose issues effectively.

Metrics tell you what is happening at a high level - request rates, error rates, latency percentiles, CPU usage, memory consumption. They're cheap to collect, fast to query, and perfect for dashboards and alerts.
Logs tell you the details - the specific error message, the request payload that caused a failure, the stack trace. They're expensive to store at scale but essential for root cause analysis.
Traces show you the journey of a single request across multiple services. When a checkout request is slow, traces tell you whether the bottleneck is in the payment service, the inventory service, or the database.

Metrics: Prometheus + Grafana

Prometheus is the industry standard for metrics collection in cloud-native environments. I deploy it using the kube-prometheus-stack Helm chart, which bundles Prometheus, Grafana, Alertmanager, and a curated set of recording rules and dashboards.

Key architectural decisions:

Federation for scale: Each Kubernetes cluster runs its own Prometheus instance. A central Thanos or Cortex instance aggregates metrics across clusters for global dashboards and long-term storage.
Recording rules for performance: Instead of running expensive PromQL queries at dashboard load time, I pre-compute common aggregations (e.g., 5-minute error rates) as recording rules. This makes dashboards load in milliseconds instead of seconds.
Alert routing: Alertmanager routes critical alerts to PagerDuty, warnings to Slack, and informational alerts to email. I use inhibition rules to prevent alert storms - if an entire node is down, I don't need 50 separate pod alerts.

Logs: Fluent Bit + CloudWatch / Loki

For log collection, I deploy Fluent Bit as a DaemonSet on every node. It's lightweight (10-15MB memory per node), fast, and supports hundreds of output plugins. Logs flow from containers to Fluent Bit, get enriched with Kubernetes metadata (pod name, namespace, labels), and are shipped to the storage backend.

For AWS-native stacks, I use CloudWatch Logs with Insights for querying. For multi-cloud or cost-sensitive environments, Grafana Loki offers a Prometheus-like experience for logs at a fraction of the cost - it indexes only labels, not the full log content, which dramatically reduces storage costs.

The most important log engineering decision: structured logging. Every application should emit JSON logs with consistent fields - timestamp, level, service, trace_id, message. This makes querying, filtering, and correlating across services trivial. Unstructured logs are effectively unsearchable at scale.

Traces: OpenTelemetry

OpenTelemetry has become the de facto standard for distributed tracing (and increasingly for metrics and logs too). I instrument applications with the OpenTelemetry SDK, which auto-instruments HTTP clients, database drivers, and messaging libraries with minimal code changes.

Traces are collected by the OpenTelemetry Collector, which runs as a sidecar or DaemonSet. The Collector handles sampling, batching, and exporting to backends like Jaeger, AWS X-Ray, or Grafana Tempo. I use tail-based sampling to keep 100% of error traces and slow traces while sampling 10% of healthy traces - this gives full visibility into problems without the storage cost of capturing everything.

SLOs: Tying It All Together

Metrics, logs, and traces are raw ingredients. Service Level Objectives (SLOs) turn them into actionable reliability targets. I define SLOs for every production service based on two key indicators:

Availability SLI: Percentage of requests that return a non-5xx response. Target: 99.95% (allows ~22 minutes of downtime per month).
Latency SLI: Percentage of requests completed within the target latency. Target: 99% of requests under 200ms.

I build error budget dashboards in Grafana that show how much of the error budget has been consumed over the current window. When the error budget is healthy, teams ship features aggressively. When the budget is nearly exhausted, the team shifts focus to reliability work. This replaces subjective arguments about "should we ship or stabilize?" with data-driven decisions.

Alerting Philosophy

Most alerting is broken because it's based on symptoms rather than user impact. I follow this hierarchy:

Page-worthy (PagerDuty): SLO burn rate exceeds threshold - real users are being affected right now.
Urgent (Slack): Leading indicators suggest an SLO breach is likely within hours - disk filling up, certificate expiring soon.
Informational (email/dashboard): Trends that need attention this sprint - increasing latency p99, growing queue depth.

The goal is that every page results in a meaningful action. If an alert fires regularly and gets ignored, it's either misconfigured or not worth alerting on. I review alert fatigue metrics monthly and aggressively prune noisy alerts.

Getting Started

You don't need to build all of this at once. Start with metrics and dashboards (week 1), add structured logging (week 2), implement SLOs for your most critical service (week 3), and add tracing (week 4). Each layer compounds the value of the others. A trace ID that appears in both your logs and your traces transforms a 2-hour debugging session into a 5-minute investigation.