Observability and Monitoring as First-Class Engineering Work

There is a particular sentence that almost guarantees trouble down the line:

“We’ll add monitoring later.”

In isolation, it sounds harmless. In context, it’s often a euphemism for: we’re about to ship something we can’t see, can’t measure, and don’t fully understand once it’s in the wild.

If you ship code or configuration without observability, you are not just “moving fast.” You are pushing unknown risk into production. The system might work. It might not. Worse, it might half-work in ways that are painful for customers and invisible to you until someone escalates.

High-reliability teams treat observability and monitoring as part of the product, not accessories. The question isn’t “Do we have dashboards?” It’s: Can we confidently answer the questions that matter when something goes wrong—or before it goes wrong?

That requires a mindset shift:

Monitoring is not a dashboard you build; it’s a contract between the system and the operators.

The system agrees to emit certain signals, including metrics, logs, traces, events, and network telemetry that accurately reflect its internal state and the user’s experience. In return, operators agree to watch those signals, define meaningful thresholds and alerts, and use them to take action.

Without that contract, everyone is guessing.

Instrumentation as Part of the API, Not a Postscript

In most codebases, observability is sprinkled in as an afterthought; some log lines here, a metric or two there, maybe a trace if someone remembers. You end up with noisy logs, half-instrumented paths, and dashboards that don’t quite match reality.

Treating observability as first-class means designing instrumentation alongside APIs and data models.

When you define a new service, route, or protocol, you ask:

  • What are the core events this component should emit? (successes, failures, retries, state changes)

  • What are the key metrics that describe its health and performance? (latency, error rates, throughput, saturation, queue depths, resource usage)

  • How will requests be traced end-to-end through this component and beyond? (trace IDs, correlation IDs, span structure)

  • What needs to be logged, in structured form, for debugging and auditability? (who did what, when, with which parameters, and what happened)

These questions show up in the design doc and the code review, not months later in a “monitoring initiative.”

For example, when you introduce a new API for submitting payment transactions, you don’t just specify the request and response JSON. You also specify:

  • A counter for the number of transactions processed, broken down by outcome (success, declined, error).

  • Latency histograms for processing time, with clear SLO targets (e.g., “99% under 200ms”).

  • A trace that spans from the frontend through all internal services involved in the transaction, so you can see where time is spent.

  • Logs that capture key identifiers and decision points (e.g., risk checks, downstream calls, external provider responses) while respecting privacy and compliance constraints.

On the network side, when you design a new fabric or routing policy, you don’t stop at “flows are forwarded correctly.” You define:

  • Telemetry that tracks link utilization, loss, jitter, and packet drops.

  • Per-class or per-tenant QoS counters.

  • Route convergence metrics: how long it takes for changes to propagate.

  • Health signals for control-plane and data-plane components, from routing processes to line cards.

If you can’t describe how you’ll know this system is healthy, and how you’ll know it’s drifting toward unhealthy, you’re not done designing it.

Subscribe to keep reading

This content is free, but you must be subscribed to The Routing Intent by Leonardo Furtado to continue reading.

Already a subscriber?Sign in.Not now

Keep Reading

No posts found