Most teams learn “high availability” by protecting what is obviously critical: spines, leafs, border routers, firewalls, links, power feeds, and route reflectors. Then they automate the network, often successfully, and quietly introduce a new single point of failure: the automation and control systems that now define how the network is operated.
That blind spot only shows up when the pressure is real.
A fabric controller becomes unreachable right in the middle of a production incident. A change must be made now. The on-call engineer asks two deceptively simple questions:
“Can I touch the devices directly if the controller is down?”
“If the controller is down, do I lose observability too?”
This is the deeper truth: modern networks are not just routers and switches. They’re a coupled system of data plane + control plane + management/automation plane + observability plane. If you want genuine resilience, you must engineer HA across all of them, especially the software stack that governs intent, change, and evidence.

