High Availability for Network Automation Systems

Most teams learn “high availability” by protecting what is obviously critical: spines, leafs, border routers, firewalls, links, power feeds, and route reflectors. Then they automate the network, often successfully, and quietly introduce a new single point of failure: the automation and control systems that now define how the network is operated.

That blind spot only shows up when the pressure is real.

A fabric controller becomes unreachable right in the middle of a production incident. A change must be made now. The on-call engineer asks two deceptively simple questions:

“Can I touch the devices directly if the controller is down?”
“If the controller is down, do I lose observability too?”

This is the deeper truth: modern networks are not just routers and switches. They’re a coupled system of data plane + control plane + management/automation plane + observability plane. If you want genuine resilience, you must engineer HA across all of them, especially the software stack that governs intent, change, and evidence.

1) The first mental model: “management + assurance plane” is not the traffic path

Let’s start with the cleanest case, because it clarifies the entire problem space:

Juniper Apstra is management and assurance, not the traffic path. If Apstra becomes unavailable, the spine–leaf fabric keeps forwarding with the last applied configuration; what you lose is orchestration, intent validation, and Apstra’s own analytics/UX, unless you designed around that.

That model is not unique to Apstra. It’s the right baseline for most automation systems:

They are not forwarding traffic.
They are not running your BGP sessions (except in architectures where controllers participate directly in the control plane, which is a different class).
They do define how changes happen, how drift is detected, and how evidence is presented.

So the network may stay “up,” but your ability to operate it may degrade sharply.

And in 2025 realities, “operate” is often the difference between a 10-minute incident and a 6-hour outage.

2) Controllers are not all equal: Apstra vs Cisco ACI vs in-house stacks

It’s tempting to lump all controllers into one box and call it “the orchestrator.” But the failure blast radius depends on what the system actually controls.

Apstra (intent + assurance for EVPN/VXLAN fabrics)

Apstra is designed to manage intent, generate configurations, deploy them, and continuously validate state against intent. It can be run as a VM cluster with a controller node and worker nodes to scale resources and improve resilience. Juniper Networks

Cisco ACI (policy controller for a tightly-coupled fabric)

Cisco APIC is the point of policy configuration and the place where operational stats/telemetry are processed for visibility and health; APIC is deployed as a cluster, with Cisco recommending a minimum of three controllers for HA and scale. Cisco+1

ACI is still not your traffic path either, but it is far more operationally central because the entire fabric’s policy model lives there. You can often keep forwarding during APIC outages, but your ability to make safe policy changes, troubleshoot via the platform, or restore certain workflows can become constrained.

In-house automation stacks (the hyperscaler pattern)

In hyperscalers, the “controller” is rarely a single product. It’s usually a distributed automation platform made of:

source-of-truth (inventory + intent model)
compilers/renderers (intent → vendor config)
deployers (safe rollout engines)
validators (drift, invariants, SLO guards)
telemetry ingestion and correlation
audit/event systems
workflow orchestration and remediation engines

These are software services, meaning they inherit every HA problem of modern distributed systems: leader election, databases, queues, consistency, idempotency, backpressure, partial failures, and degraded modes.

That is why hyperscalers treat network automation as a first-class production platform, owned with SRE rigor rather than “a set of scripts.”

3) The two questions that matter during an outage

Question A: “Can I act on an individual switch if the orchestrator is unavailable?”

Technically: yes. Operationally: it depends on how mature your system is.

In a traditional enterprise (or most organizations), a switch is still a switch: you can reach it via OOB/console/SSH and change configuration. The danger is not whether you can, it’s what happens after the controller returns:

Any manual change becomes out-of-band drift relative to your declared intent.
When the controller performs a full push, it may overwrite the emergency fix.
Even worse: you may “fix” the symptom locally but violate global invariants the controller normally enforces (security tags, route policies, EVPN symmetry, etc.).

This is why this article emphasized the real contract:

❝

Break-glass may be necessary to restore service quickly, but it must be reconciled, either by incorporating the change back into the controller’s intent/templates or by reverting the device back to intended state.

That reconciliation step is the difference between “incident mitigation” and “planting the seed of the next incident.”

Question B: “Does telemetry/observability stop when the orchestrator is unavailable?”

If your observability depends on the orchestrator’s data stores and UI, then yes: that platform’s dashboards and validations go dark.

But the devices do not stop emitting signals.

Switches continue generating syslog, responding to SNMP, exporting streaming telemetry, producing gNMI/NETCONF state, exporting flow logs, and so on. What changes is whether you still have a separate ingestion + storage + alerting pipeline that continues functioning when the controller is down.

So the right question becomes:

Did you build observability as an independent plane, or did you outsource “evidence” to the controller UI?

“The world doesn’t stop,” but you should not depend exclusively on Apstra (or any controller) as your only source of monitoring; you need an independent stack for logs/metrics/alerting.

4) Break-glass: what it really means in hyperscalers vs traditional orgs

Break-glass is widely misunderstood. Many teams treat it as “SSH in and do what you must.”

That is not break-glass. That’s just “normal operations with heroics.”

Traditional organizations: break-glass is usually a trust model

In most environments, the organization relies on policy and culture:

“Don’t make changes outside the controller.”
“If you do, tell the team.”
“Document it later.”

That’s a human trust chain. It works… until it doesn’t.

And when it fails, it fails in the worst possible way: silent drift, untracked configuration deltas, compliance gaps, and the next automated push “healing” the network back into the broken state.

In most companies, preventing unauthorized change is largely trust-based because the environment does not technically prevent direct SSH changes.

Hyperscalers: break-glass is a mechanism, not a suggestion

At hyperscale, the philosophy flips:

Normal operation = declarative automation
Manual change = forbidden by default
Break-glass = time-bound, auditable exception enforced by systems

The exact approval chain varies by company and system, but the consistent pattern is that access is gated by tooling (ticket metadata, limited scope, strong identity, full audit). The reason is simple: in a fully automated environment, out-of-band changes are not just “risky” (no… scrap that… they are very dangerous!), they are incompatible with the operating model because automation will continuously attempt to converge state to intent.

Hyperscalers also evolve mechanisms to prevent automation from blindly overwriting emergency changes until humans reconcile them, which we describe as “override prevention mechanisms” and “Andon cords.”

This matters because it’s the missing engineering layer in traditional orgs: they have “a controller,” but they don’t have the socio-technical safety system around it.

5) Designing HA for automation systems: the real blueprint

If you take nothing else from this article, take this:

The HA goal is not “controller uptime.” The HA goal is “operational continuity under partial failure.”

That means designing the automation ecosystem so that:

The network keeps forwarding (obvious)
The ability to restore service remains available
The ability to observe and prove the state remains available
The ability to make controlled changes remains available, or you have a safe degraded mode

Let’s translate that into concrete architecture.

Layer 1 — Controller HA (product-level)

Cisco ACI APIC: run it as a proper cluster (minimum three controllers), size it for transaction rate, and treat APIC health as a first-class SLO because it is the policy brain. Cisco+1

Juniper Apstra: use VM clustering where appropriate, adding worker nodes to scale/absorb load (especially for analytics/offbox agents/IBA). Juniper Networks

But clustering alone isn’t enough.

Layer 2 — Data durability (backup + restore that actually works)

Controllers are software. Software fails in messy ways: database corruption, broken upgrades, disk fills, container failures, expired certs, bad NIC bonding, you name it.

So you need tested backup/restore and off-box copies:

Apstra supports database backup via aos_backup, producing dated snapshots in /var/lib/aos/snapshot/... and you can restore from those snapshots. Juniper Networks+1
Apstra’s database runs in containers and lives under /var/lib/aos/db, which highlights an operational reality: if the VM is gone and your snapshots were only local, your “backup” died with the server. Juniper Networks

A mature design has:

Routine backups exported off the appliance/VM
Periodic restore drills (not optional)
A warm spare or rapid redeploy path (VM templates, IaC, golden images)
Clear RPO/RTO targets aligned to business risk

Layer 3 — Independent observability (because “controller down” must not mean “blind”)

This is where many organizations lose the plot.

If the controller provides assurance dashboards, that’s great, but you still need a parallel pipeline:

syslog → log platform (index, search, alert)
streaming telemetry → TSDB + dashboards + alert rules
SNMP/gNMI polling where appropriate → NMS/metrics
infrastructure health for the controller itself (VM/host/storage/network)

The point is not duplicating UIs. The point is preserving signal continuity when the controller UI is unavailable.

Layer 4 — Safe degraded operations (what you do during the outage)

This is your “break-glass engineering.”

You need three things, and they must be designed before the outage:

Independent access path: real OOB/console that does not depend on the controller network. (If your controller assumes OOB to reach devices, you must ensure OOB survives controller failure.)
Auditable session + change capture: command logging, config snapshots, pre/post diffs, identity, timestamps, ticket references.
Reconciliation workflow: a defined, practiced way to bring the controller’s intent back in sync after the incident.

The reconciliation workflow is the “closing of the loop” that prevents the next automation run from undoing your emergency fix.

6) The practical “break-glass contract” that actually works

Here’s what a grown-up break-glass process looks like in reality, whether you run Apstra, ACI, or an in-house stack.

Step 1: Declare that break-glass is an exception path, not an alternate workflow

If engineers routinely bypass the controller because it’s “faster,” your automation platform will never become reliable. You’ll be permanently stuck in drift and fear.

Step 2: Put a guardrail around it

Even if you cannot implement hyperscaler-grade gating immediately, you can move beyond the trust model:

Limit who can break-glass (RBAC)
Require a ticket ID (enforced in access workflows if possible)
Record sessions (terminal recording)
Require pre/post config snapshots
Require a short “what/why/when” entry while the context is fresh

The source document explicitly highlighted the operational need to define who can do it, how changes are recorded, and how reconciliation happens.

Step 3: Reconcile as a first-class incident task

Not “later.” Not “when we have time.”

Reconciliation is part of mitigation completion:

If the fix is correct and should persist → encode it into intent/templates (controller) or into the source-of-truth pipeline (in-house).
If the fix was a tactical band-aid → revert it explicitly once the controller returns.
If the controller would overwrite it → use a controlled “freeze/andon cord” mode until reconciliation is complete.

This is how hyperscalers stay sane: they treat the automation platform as the system of record, and everything else is either forbidden or reconciled quickly.

7) A reality check: “HA of automation” is really about SRE discipline

Once you accept that automation is production software, the conclusions become obvious:

Define SLOs for automation availability (UI/API), deployment pipeline health, validation freshness, telemetry lag, and drift detection latency.
Build error budgets and release gates.
Use canaries for automation changes, too (your platform is capable of breaking the network at scale).
Treat “automation-caused incidents” as severe as hardware outages, because at scale, they are.

Also, rigorous postmortems and engineering follow-through so the same class of failure does not require break-glass again.

That is the real maturity curve:

From “controller as a tool”
To “controller as the operating system of the network”
To “network automation as a production platform with HA, DR, and governance”

The fundamental truth

Break-glass is not a shameful workaround, and it’s not a sign that automation failed.

Break-glass is proof that you’re operating a real system under real failure.

The difference between hyperscalers and traditional organizations is not that hyperscalers never need break-glass. It’s that hyperscalers engineer break-glass as a controlled, auditable, time-bound safety valve, and then they build the platform improvements so they need it less over time.

If you want high availability in modern networks, you must design HA not only for switches and links, but for:

The controller cluster
The controller database + backups + restore drills
Independent observability
And the operational continuity path when the controller is down

Because in the incident that matters most, the question won’t be “is the fabric forwarding?”

It will be: can you still see, act, and recover (safely) when your automation brain is unavailable?

That's the whole point!

Leonardo Furtado

High Availability for Network Automation Systems

1) The first mental model: “management + assurance plane” is not the traffic path

2) Controllers are not all equal: Apstra vs Cisco ACI vs in-house stacks

Apstra (intent + assurance for EVPN/VXLAN fabrics)

Cisco ACI (policy controller for a tightly-coupled fabric)

In-house automation stacks (the hyperscaler pattern)

3) The two questions that matter during an outage

Question A: “Can I act on an individual switch if the orchestrator is unavailable?”

Question B: “Does telemetry/observability stop when the orchestrator is unavailable?”

4) Break-glass: what it really means in hyperscalers vs traditional orgs

Traditional organizations: break-glass is usually a trust model

Hyperscalers: break-glass is a mechanism, not a suggestion

5) Designing HA for automation systems: the real blueprint

Layer 1 — Controller HA (product-level)

Layer 2 — Data durability (backup + restore that actually works)

Layer 3 — Independent observability (because “controller down” must not mean “blind”)

Layer 4 — Safe degraded operations (what you do during the outage)

6) The practical “break-glass contract” that actually works

Step 1: Declare that break-glass is an exception path, not an alternate workflow

Step 2: Put a guardrail around it

Step 3: Reconcile as a first-class incident task

7) A reality check: “HA of automation” is really about SRE discipline

The fundamental truth

Keep Reading

The Routing Intent
by Leonardo Furtado

High Availability for Network Automation Systems

1) The first mental model: “management + assurance plane” is not the traffic path

2) Controllers are not all equal: Apstra vs Cisco ACI vs in-house stacks

Apstra (intent + assurance for EVPN/VXLAN fabrics)

Cisco ACI (policy controller for a tightly-coupled fabric)

In-house automation stacks (the hyperscaler pattern)

3) The two questions that matter during an outage

Question A: “Can I act on an individual switch if the orchestrator is unavailable?”

Question B: “Does telemetry/observability stop when the orchestrator is unavailable?”

4) Break-glass: what it really means in hyperscalers vs traditional orgs

Traditional organizations: break-glass is usually a trust model

Hyperscalers: break-glass is a mechanism, not a suggestion

5) Designing HA for automation systems: the real blueprint

Layer 1 — Controller HA (product-level)

Layer 2 — Data durability (backup + restore that actually works)

Layer 3 — Independent observability (because “controller down” must not mean “blind”)

Layer 4 — Safe degraded operations (what you do during the outage)

6) The practical “break-glass contract” that actually works

Step 1: Declare that break-glass is an exception path, not an alternate workflow

Step 2: Put a guardrail around it

Step 3: Reconcile as a first-class incident task

7) A reality check: “HA of automation” is really about SRE discipline

The fundamental truth

Keep Reading

The Routing Intent by Leonardo Furtado

The Routing Intent
by Leonardo Furtado