Sorry, this isn't an email. It's a deep dive into a topic I've wanted to write about here in this newsletter for a long time, and it took a lot of effort to express it clearly. I've split it into two parts because email services like Gmail might cut it off, so be sure to read it online and check both parts!

1) So, the story begins…

It starts the way most painful incidents start: quietly, politely, and with a change that looks harmless enough to approve without a second thought.

The ticket is boring in the way “routine hygiene” is boring. A network team is rolling out a small, standardized update across a fleet. Maybe it’s a BGP community tag that needs to be appended for a new traffic-engineering classification. Maybe it’s a harmless-seeming knob like tightening an inbound policy, normalizing an interface description format, or enabling a telemetry sensor path that was missing from a handful of boxes.

Nothing in the diff suggests drama. The change has been applied in labs. It’s been canaried in one low-risk site. The unit tests passed. The pipeline marked the build green. The campaign is scheduled during a quiet window.

And then the campaign begins.

At first, it looks like success. Devices start reporting “in progress,” then “complete.” The deployment queue drains at a comfortable pace. The dashboard shows the clean green rhythm every automation engineer loves: compile, validate, stage, push, confirm. In the NOC, the graph lines are flat. No alarms and no sudden shifts either. The change is almost boring enough to forget (and we love this!).

That’s the moment the real system reveals itself.

Somewhere in the chain, far from the change request and far from the comforting checkmarks, one subtle defect slips through. It isn’t a catastrophic bug like “wipe all configs” or “shutdown all ports.” Those are loud, obvious, and get caught.

This defect is the kind that shows up only when it touches the messy edges of reality.

A template variable is slightly wrong: not wrong enough to break compilation, just wrong enough to produce the wrong community on a subset of devices where the variable resolves differently. Or inventory is stale: a device was upgraded, but the source of truth still believes it runs an older NOS version and supports a feature it no longer supports, or vice versa. Or the capability inference logic misclassifies a hardware variant: a leaf that looks like the other leafs isn’t actually the same ASIC profile, so a generated knob becomes a no-op, or worse, it changes behavior in a way that was never modeled.

Nothing explodes immediately. That’s the cruel part.

The campaign continues to push. The pipeline continues to do what it was designed to do: take intent, render it, deploy it, and converge reality. It does not “feel” uncertainty. It does not hesitate. It does not get nervous because a graph twitched. It just… executes.

And then signals start to appear, not as a single, clean failure, but as a pattern that’s easy to misread if you’ve never been burned at scale.

A handful of sessions flap in one region. A transit interface starts taking a slightly different path than expected. A route-map that should be logically identical now has a different ordering on a subset of devices due to a tiny difference in how the platform compiled the list. A policy meant to tag one class of prefixes now tags another. A very small percentage of customers see a new latency spike. An internal service starts failing health checks in one Availability Zone. The first alert is not “network down.” The first alert is “something smells off.”

In a traditional world, where changes happen box-by-box, where the human operator is the rate limiter, this is where the incident would slow down. Someone would stop and investigate. Someone would feel the discomfort and pause before touching the next router.

But automation does not have discomfort.

Automation is throughput. It does things really fast.

A fleet-wide campaign isn’t about one person making 10 changes. It’s an assembly line stamping out the same change thousands of times. It is consistency, speed, and scale, until it’s consistency, speed, and scale applied to a defect.

That’s when the blast radius stops being a device.
It becomes a function.
It becomes “everything that matches the selector.”

Because that’s how modern automation works. We don’t target devices by walking through racks. We target them by attributes: region, role, model, tag, and intent group. The very thing that makes automation powerful, its ability to act on classes of infrastructure, also makes failures propagate across those same classes.

  • If a human types the wrong command on one router, you get one bad router.

  • If a pipeline renders the wrong intent for a role, you get a bad role across an entire region.

  • If the source of truth is wrong, your automation doesn’t just drift: it converges confidently to the wrong state.

And the system does not stop to ask whether the world still agrees with its model.

Now the incident has a different texture.

logo

Subscribe to our premium content to read the rest.

Become a paying subscriber to get access to this post and other subscriber-only content. No fluff. No marketing slides. Just real engineering, deep insights, and the career momentum you’ve been looking for.

Upgrade

A subscription gets you:

  • ✅ Exclusive career tools and job prep guidance
  • ✅ Unfiltered breakdowns of protocols, automation, and architecture
  • ✅ Real-world lab scenarios and how to solve them
  • ✅ Hands-on deep dives with annotated configs and diagrams
  • ✅ Priority AMA access — ask me anything

Keep Reading