1. Case study: BGP RR cluster split and hidden divergence
Let’s stitch everything together about “Intent” vs. “State” and when things diverge (actual drifting) with a concrete story.
We’ll take a failure mode that’s common in large networks, rarely obvious, and deeply annoying to debug:
A BGP route reflector (RR) cluster silently splits into two “islands of truth.”
Half of it sees the world one way, half sees it another. Nobody notices, until customers do.
1.1. The scenario: a subtle control-plane partition
You run a global backbone with multiple regions and POPs. Your BGP design looks roughly like this:
Each region has its own RR cluster.
Edge routers (PEs) in that region peer with the local RRs.
RR clusters are themselves connected across a control-plane mesh, so global routes and customer VPNs propagate.
One day, a change goes in to “tighten” control-plane ACLs on infrastructure interfaces.
The intent is good: restrict which boxes can send BGP traffic to the RRs.
The change is rolled out gradually, region by region.
In one RR cluster, though, the ACL is wrong on a subset of nodes:
Half the RRs still accept BGP sessions and updates from Region SA.
The other half silently drops those packets due to the new ACL.
The sessions between RRs remain up where they’re allowed, but some paths no longer propagate across the full cluster.
From the perspective of the network:
Some RRs see
customer:Cprefixes from Region SA.Others do not.
Downstream edge routers connected to the “unlucky” RRs never learn those routes.
The cluster is effectively partitioned in terms of information, even though most individual sessions look “up”.
1.2. Without consistency tooling: a slow-burn incident
Here’s how this plays out in a traditional environment.
First signals
A few POPs in another region (say, EU-west) start seeing tickets:
“Users in South America cannot reach service S intermittently.”
Your own synthetic probes from EU-west to customer C’s prefixes mostly pass, but some fail, usually when they hit certain POPs.
Nothing is obviously on fire:
BGP sessions show as “Established” on both RRs and PEs.
Global prefix counts are roughly the same; no massive withdrawal storms.
Core CPU and memory look fine.
Debugging begins
Engineers jump onto the bridge and start the usual dance:
show bgp vpnv4 unicast vrf customer-con an edge router in EU-west:The prefixes from Region SA are not present when peering via RR R2.
But they are present on another edge that happens to peer via RR R1.
Weird.
They check the RRs:
On RR R1:
show bgpshows allcustomer:Cprefixes from Region SA as expected.On RR R2: no such routes exist for those VPNs.
The configs on R1 and R2 look identical:
Same neighbors.
Same route-policies.
Same BGP address families.
Everyone is squinting at what appear to be twins:
“Configs are the same, sessions are up, but one RR just doesn’t have the routes.”
Time passes
While the team bisects:
Some POPs continue to route traffic correctly (the ones depending on RRs that see the routes).
Others keep dropping traffic for
customer:C(those tied to the “blind” RRs).Every new traceroute feels arbitrary: from some places it goes through, from others it fails.
Eventually, someone notices a subtle difference in ACLs on control-plane interfaces:
On RRs R1 and R3, the ACL permits BGP from Region SA’s RRs.
On RRs R2 and R4, the ACL denies it.
They manually fix the ACLs, reset some BGP sessions, and convergence slowly restores.
The incident closes with a vague root cause, like:
“Incorrect ACL on RRs prevented route propagation for customer C from Region SA to some POPs.”
But behind that sentence were:
Long minutes (or hours) of manual diffing and eyeballing.
A lot of “but it looks the same” confusion.
Customers in certain geographies are suffering intermittent failures the whole time.
1.3. With consistency tooling: divergence is the primary signal, not a surprise
Now replay the exact same failure, but assume you’ve built the machinery we’ve been discussing:
A state comparison engine.
Multi-layer validation and state-aware observability.
Reconciliation loops for safe auto-correction.
Detection: control-plane inconsistency shows up early
As the faulty ACL rolls out:
RRs R2 and R4 stop learning
customer:Cprefixes from Region SA.RRs R1 and R3 continue to learn and reflect on them.
Your RIB-to-RIB consistency checks are running continuously for customer:C (a tagged critical prefix set):
They compare BGP tables for that tenant across all RRs in the cluster.
They expect all RRs in the same cluster to have the same view of
customer:C.
Within minutes, you get this:
bgp_rr_consistency{cluster="eu-core", tenant="customer:C"} = 0.5
Meaning: only 50% of RRs in that cluster have a consistent view of those prefixes.
In parallel, your prefix propagation SLO metrics notice something odd:
Normally, prefixes from Region SA for
customer:Cappear on all EU RRs within 10 seconds.Now, they are present only on R1 and R3.
Propagation time to R2 and R4 is effectively “infinite”.
The observability stack raises a targeted alert:
Alert: BGP state inconsistent between RR cluster members for customer:C prefixes.
Cluster:
eu-coreTenant:
customer:CHealthy RRs: R1, R3
Stale/missing RRs: R2, R4
First detected: 14:03:21 UTC
At this point, customers may just be starting to notice problems, but your internal systems have already localized the inconsistency to specific RRs and a specific prefix set.
Validation: it’s not just control-plane noise
Your multi-layer validation kicks in:
RIB on R1 and R3: prefixes from Region SA are present and correct.
RIB on R2 and R4: prefixes missing.
FIB checks on downstream edges show that nodes peering via R2/R4 lack forwarding entries for those routes.
Synthetic probes from EU POPs that depend on R2/R4 for
customer:Cfail; probes via R1/R3 pass.
Your dashboards now show a very specific picture:
Heatmap cell for
eu-core/control_alignment/tenant=customer:Cis red.Probes for flows
eu-west → customer:Cviarr=R2/R4are failing.Other tenants and regions are green.
You’re not guessing at this point. You know:
It’s a control-plane scope problem.
It’s RR-localized.
It’s tenant-specific.
1.4. Reconciliation and response: fixing the root cause with intent
With the problem localized, your reconciliation pipeline picks up the diffs:
For RRs R2 and R4, the state comparison engine shows:
Config vs intent mismatch on control-plane ACLs.
The intent model says: “Permit BGP from these Region SA RR IPs.”
Running config denies those IPs.
Reconciliation logic runs through its decision tree:
Is the deviation unambiguous drift?
Yes: intent is clear, other cluster members (R1, R3) match it.
No active change requests are open that would explain a partial ACL change.
Is the correction low-risk and well-understood?
Yes: updating ACLs on RR control-plane interfaces is a known-safe pattern when done carefully.
We have a predefined remediation playbook and prior successful runs.
Are there any systemic signals that suggest we should not auto-correct?
No: CPU, memory healthy; no flood of unrelated anomalies.
Given these answers, the loop chooses auto-correction:
It generates a reconciliation change:
“Restore control-plane ACLs on R2 and R4 to match intent version v17.”
It applies the ACL fix atomically on each RR:
Pre-check: verify existing config and BGP sessions.
Apply ACL update in a single transaction or simulated-atomic operation.
Post-check: confirm BGP sessions from Region SA RRs are now established and exchanging updates.
As routes start flowing again:
The RIB-to-RIB consistency metric converges towards 1.0.
Prefix propagation times for
customer:Cgo back to normal.Probes from all EU POPs to the Region SA customer endpoints turn green.
Asymmetry indices between EU and SA for
customer:Cflows drop back to baseline.
The system then:
Logs the entire reconciliation action with before/after snapshots.
Tags it with the original divergence alert.
Optionally opens a follow-up ticket or PR to review why the ACL was misconfigured in the first place (e.g., a flawed deployment spec?).
On the incident bridge, the conversation looks very different:
Instead of “what on earth is going on?”, it’s “we had a divergence in RR views for customer C; automation corrected control-plane ACL drift on R2/R4; monitoring confirms full recovery.”
1.5. What changed, really?
The underlying bug was the same in both stories:
A control-plane ACL change partitioned part of an RR cluster.
What changed was how fast and how precisely you noticed, and how structured your response was.
Without the consistency machinery:
Time-to-detect: long. You only become aware once customers shout loudly enough, from enough places.
Time-to-localize: longer. You burn cycles diffing configs and guessing at root causes.
Time-to-repair: depends on the skill and intuition of whoever happens to be on call.
Blast radius: high. Multiple POPs and customers experience intermittent breakage.
With state comparison, validation, observability, and reconciliation:
Time-to-detect: minutes, often before a major external impact.
Time-to-localize: near-instant, down to specific RRs and tenants.
Time-to-repair: scripted and consistent, with guardrails and validation.
Blast radius: constrained; you catch the issue while it’s still mostly a control-plane inconsistency, not a multi-hour outage.
And perhaps most importantly:
You get a clean audit trail:
“At 14:03:21, BGP RR cluster
eu-corediverged oncustomer:Cprefixes due to control-plane ACL drift on R2/R4.
At 14:05:12, reconciliation updated ACLs to match intent v17.
At 14:06:00, RIB and probe consistency restored to baseline.”
That’s exactly the kind of story you want to tell in postmortems, architecture reviews, and to yourself the next time you design a change:
Inconsistent state will happen. The question is whether your system notices and corrects it quickly, or leaves you chasing ghosts in the middle of the night.
2. Anti-patterns: how teams invite inconsistent state
By now, it should be clear that an inconsistent state is hard enough when you’re doing everything right. The brutal reality is that a lot of teams unintentionally optimize for inconsistency through their habits and culture.
If you recognize yourself in any of these patterns, that’s not an indictment. It’s a signal. These are exactly the places where small changes in behavior translate into massive improvements in stability.
Let’s walk through a few of the most common anti-patterns.
2.1. “Just fix it on the box” culture
You know this one.
An incident is in progress. Dashboards are red, chat is noisy, and customers are complaining. Someone with access and experience logs into a router, types a few commands, and the problem goes away.
There’s no PR.
There’s no ticket linking the change to the incident.
There’s maybe a shell history on that one box, until the next reload.
In the moment, it feels heroic. In reality, you’ve just created a local truth that never becomes global truth:
The source-of-truth still believes the old policy is active.
The deployment system will happily re-push that old state during the next rollout.
State comparison engines (if they exist at all) see the box as “drifting” but nobody knows why.
Over time, the network becomes a patchwork of “known special cases” that live only in people’s heads:
“Oh, don’t touch R7, it has that one tweak for customer C.”
“That edge in APAC is weird; last time we changed it, things broke.”
This is how tribal knowledge replaces intent, and nothing erodes consistency faster than that.
2.2. Source-of-truth in name only
The opposite failure looks more disciplined on paper, but is just as deadly in practice.
You have:
A Git repo with all your “golden configs”.
A policy DSL to define tenants, prefixes, and peers.
Maybe even a nice UI on top of it.
But in reality:
Rollouts happen directly from people’s laptops, bypassing the repo.
Hotfixes never get backfilled.
DNS, IPAM, and inventory systems disagree on what devices and addresses actually exist.
The repo is updated in large, infrequent “sync” commits that nobody really reviews.
The result is a stale, idealized version of the network that lags weeks behind reality.
At that point:
You can’t trust diffs between Git and running configs, because you don’t know which side is telling the truth.
Automation built on top of the repo is flying blind, making assumptions that haven’t been true in production for a long time.
Any attempt at reconciliation becomes “which fiction do we prefer today?”
A source-of-truth that isn’t actually the source, or the truth, is worse than none at all. It gives the whole organization a false sense of safety while inconsistency accumulates unchecked underneath.
2.3. Single-plane thinking
Another common anti-pattern is treating one layer as the reality and ignoring the rest:
Teams that focus exclusively on config diffs: if the text matches, they assume the network is healthy.
Teams that obsess over control-plane tables (RIB, LSDB, BGP) and assume FIB and ASICs will “just follow”.
Teams that rely only on data-plane probes and never ask whether the intent or control-plane view is correct.
Each of these is incomplete:
Config-only thinking misses FIB programming bugs, partial ACL installs, ECMP skew, and vendor-specific quirks.
Control-plane-only thinking misses local blackholes, corrupted adjacencies, or incorrect policy enforcement at the edges.
Probe-only thinking catches pain but not cause; it tells you that things are broken but not why your network’s internal stories disagree.
Single-plane thinking is how you end up with incidents like:
“All configs match, all protocols are up, but 3% of flows still die when they hash onto one leg of an ECMP group.”
“Traceroute looks insane from some POPs, but the RIB looks perfect; we never checked the ASIC.”
“Synthetic tests fail, but nobody can explain which layer is lying, so we just keep rebooting boxes.”
If you never connect intent → config → control-plane → data-plane → observation, then an inconsistent state will always be a surprise, usually discovered when customers complain.
2.4. Fire-and-forget automation
The last anti-pattern looks sophisticated until you look closely.
Teams invest in automation:
Ansible playbooks, Netconf/gNMI pipelines, custom controllers.
“One click” jobs that push config to hundreds or thousands of devices.
Cron jobs or CI flows that apply changes on a schedule.
Then they stop there.
There are no:
Pre-change state checks.
In-flight guardrails or canary stages.
Post-change validation beyond “did the command return success?”
Telemetry hooks that know which version is deployed where.
These systems push state, but never listen.
When something goes wrong:
The pipeline happily reports “SUCCESS” because all RPC calls returned 200 OK.
Devices that silently reject parts of the config go unnoticed.
Partial deployments leave pockets of old and new policy in the same topology.
There is no systematic way to tie an SLO breach back to a specific change, version, or batch.
Fire-and-forget automation scales your ability to create an inconsistent state. It does nothing to help you detect, understand, or correct it.
It’s like replacing a hand saw with a chainsaw, then using it blindfolded.
These anti-patterns are not exotic. They are the default in many orgs, especially those that “grew into” large networks rather than designing for scale from day one.
The upside is: they’re also some of the easiest levers to pull if you want to get serious about consistency:
Stop treating box-local hotfixes as harmless one-offs; always reconcile them with intent, or explicitly reject them.
Make your source-of-truth actually authoritative and keep it within hours, not weeks, of production reality.
Force yourself to look across planes when debugging or designing; if your solution lives entirely in one layer, assume you’re missing something.
Refuse to ship automation that doesn’t have feedback loops, no more pushes without pre/post checks, and telemetry.
I showed what “good” looks like: comparison engines, multi-layer validation, consistency-aware observability, and controlled reconciliation.
Avoiding these anti-patterns is how you make sure all of that work actually sticks, and that your network’s stories about reality stay aligned, even as the system grows and changes.
Leonardo Furtado

