This article is divided into three parts. Be sure to read the first and second parts in a separate article.

I also recommend reading this in your browser, as some email services like Gmail may clip it due to its length.

7. Designing atomic deployments and rollbacks for state consistency

So far, we’ve mostly talked about detecting inconsistent states.

But a huge amount of inconsistency is created by the way we apply change in the first place. If your deployment model is “spray some CLI here and there and hope for the best”, you’re basically manufacturing drift as a service.

The antidote is to treat network changes the way serious software systems treat database transactions:

Either the change happens as a coherent unit, or it doesn’t happen at all.
No half-applied policies. No “I’ll come back to that other POP later”. No mystery boxes stuck in-between.

That’s what atomic deployments and rollbacks are about.

7.1. Why piecemeal changes are a drift factory

Traditional network operations grew up on the CLI:

  • You SSH into one router and paste a few lines.

  • Then to another, paste a slightly different variant.

  • Someone else does the same in a different time zone.

  • In total, you “rolled out the change.”

It’s easy to see how this goes wrong:

  • You forget one device in scope.

  • You mistype a policy name on a single box.

  • You run out of maintenance window halfway through and stop at 70% coverage.

  • A connection dies mid-paste; half the stanza is applied, while the other half is not.

From the network’s perspective, you’ve created multiple versions of reality:

  • Some devices have the new policy.

  • Some have the old one.

  • Some have a hybrid of both.

At a small scale, you can sometimes get away with this because traffic doesn’t always hit the “weird” devices. At hyperscale:

  • There is always traffic hitting the weird devices.

  • There is always a change in-flight somewhere.

  • The chance of “some nodes are partially updated” is essentially 1.

If you design deployments as a series of ad-hoc, per-device edits, inconsistent state isn’t an edge case. It’s an inevitable outcome.

Subscribe to keep reading

This content is free, but you must be subscribed to The Routing Intent by Leonardo Furtado to continue reading.

Already a subscriber?Sign in.Not now

Keep Reading