1. When “Automation” Is Just Faster Manual Work
Let’s start with an uncomfortable truth.
Most “network automation” out there today is just manual work… at 10× speed. 😅
You take the same commands you used to type on the CLI, you wrap them in an Ansible playbook or a Python script, and suddenly you can touch 50 devices instead of 5. From a distance, it looks like progress. Internally, it still feels like firefighting, just... more efficiently.
And in a lot of organizations, that’s where the journey stops.
Engineers proudly say, “We’ve automated BGP policy changes!” but what they really mean is “We’ve taught a script to SSH into 80 routers and paste config.” No real pre-checks. No proper blast radius control. No systematic rollback. No telemetry-driven validation. If the script or the human makes a mistake, the only difference is that the outage is now faster and bigger.
Meanwhile, the teams at hyperscale companies figured this out more than a decade ago. They tried the “clever script” phase. They paid for it with outages, late nights, and very expensive lessons. Then they moved on.
The point isn’t that you need to run at their scale. Most networks never will. The point is that they’ve already done the dangerous experiments for you. They’ve shown what works and what doesn’t when you push automation to its limits.
So instead of obsessing over tools such as Ansible vs Nornir vs vendor X’s magic platform, it’s time to ask a better question:
“What if we treated network automation as a software system, not as just a bag of scripts?”
That’s the mindset shift that changes everything.
2. What Hyperscale Learned the Hard Way
Fifteen or so years ago, big web-scale companies were right where many operators are today.
A few passionate engineers started writing scripts to get rid of repetitive tasks: pushing ACLs, updating BGP neighbors, and rotating keys. They wrote Bash, Perl, and Python. They stacked scripts on scripts. They created internal “runbooks” that were basically collections of one-off CLI macros.
And then things started breaking.
A script that worked fine on ten devices suddenly behaved differently on a new hardware generation. A poorly tested mass change pushed a bad route-map everywhere at once. A “quick fix” for one region accidentally impacted traffic globally because no one had a clear picture of how everything was wired together.
They discovered that:
Having many scripts is not the same as having a system.
Speed without guardrails just multiplies the blast radius.
Humans are very bad at reasoning about distributed systems from a collection of ad-hoc tools.
So they pivoted.
Instead of thinking “What else can I script?”, they started asking:
“What is the source of truth for our network?”
“What is the desired state for each service and device?”
“How do we know if reality matches that desired state?”
“How can we safely move from state A to state B, with feedback loops, not blind pushes?”
They built controllers, intent models, health checks, canary deployments, and reconciliation loops. They invested in telemetry and in tools that treat the network as data, not just as a set of CLIs.
And they are still learning and refining.
The key insight for you is simple: you don’t have to repeat their mistakes on your smaller network. You can borrow the patterns, scaled to your reality.
3. The Script Trap: Speed Alone Is a Dead End
Let’s zoom back into a typical ISP, enterprise, or data center environment.
The “automation success story” often looks like this:
There’s a Git repo with dozens or hundreds of playbooks and scripts.
Two or three people understand how they work. Everyone else is afraid to touch them.
The scripts connect to devices directly over SSH or NETCONF and push config templates.
If a playbook fails on device 37 out of 80, the engineer eyeballs the damage and decides what to do next.
Verification is mostly manual: a few pings, a quick
show bgp summary, maybe a check on a monitoring dashboard.
On a calm day, this feels fine. “Look, we updated all our PE routers in 10 minutes instead of three hours. Automation rocks.”
But under pressure, during a major migration, a vendor bug, or a crisis, the weaknesses show:
You have no central view of desired vs actual state.
You can’t easily answer “What changed where in the last hour?”
Rolling back is another ad-hoc script… if there is a rollback at all.
If someone makes a bad variable change in a template, you quickly propagate that mistake everywhere.
This is what I mean when I say: you’ve barely gotten started.
You’ve made the keystrokes faster, but you haven’t fundamentally changed the operating model. You’re still imperative (and not declarative as you should adopt it): “Do this, then that, in this order, everywhere.” You’re still relying on human judgment, pattern-matching, and luck at exactly the moments when humans are most likely to be tired and stressed.
And from a career perspective, you’ve turned yourself into “the person who owns the magic scripts,” which is fragile. If you burn out or move on, the whole automation tower wobbles.
There is a better way.
4. Observability First: You Can’t Automate What You Can’t See
The foundation of serious network automation isn’t a language or a framework. It’s observability.
If you cannot reliably see:
How your devices are behaving,
How your control plane is behaving, and
How your customers are experiencing the network,
Any automation you build is flying blind.
Observability in a modern network spans multiple layers:
Device health: CPU, memory, temperature, interface counters, queue depths.
Control-plane state: BGP sessions, OSPF/IS-IS adjacencies, LSDB and RIB consistency, label distributions.
Data-plane behavior: loss, latency, jitter, path changes, ECMP imbalances.
Flow-level insight: IPFIX/NetFlow, sFlow, sampled traffic, who talks to whom, and how much.
Routing correctness: route leaks, hijacks, unexpected prefixes; often via BMP feeds or RPKI/ASPA validation.
Hyperscale teams treat telemetry not as “nice to have graphs,” but as inputs to automation:
Before a change, they query current health: are SLOs within bounds? Are there existing incidents?
During a rollout, they watch key metrics: loss, latency, error rates, and session flaps. If anything crosses a threshold, the pipeline halts automatically.
After a change, they verify: is the new path actually in use? Did convergence behave as expected? Has capacity shifted as planned?
You can do a lighter version of this on a much smaller network:
Stream basic telemetry (interfaces, CPU, BGP sessions) into a time-series database.
Define simple service-level checks: “Can we reach these key prefixes/URLs with acceptable latency?”
Integrate your automation pipelines with health checks: no pipeline should blast changes if the network is already degraded.
Without observability, “automating” is just hoping. With observability, automation becomes a controlled experiment.
5. Designing for Failure and Blast Radius Containment
A reliable network is not one that never fails. It’s the one that fails predictably and locally.
Many automation stories ignore this. A script that touches every core or PE router in one go feels powerful, but it’s also terrifying. You’ve just made it possible to spread a bad policy or template everywhere in minutes.
Hyperscale engineers think about changes in terms of blast radius:
What’s the smallest scope we can safely change first? A single device? A pod? A region?
If something goes wrong, how contained will it be?
How quickly and automatically can we rollback or route around the damage?
That thinking leads to patterns like:
Canary devices or POPs: deploy new policies or software to one or two nodes first, watch behavior, then expand.
Wave-based rollouts: region by region, or ring by ring, with pause points and automatic halts.
Change budgets: only a certain volume of risky changes allowed in a given window.
Predefined rollback plans: not “we’ll figure it out,” but concrete, encoded procedures in the automation itself.
Imagine you’re changing your edge BGP policy to tighten prefix filters and attach new communities for traffic engineering. A “blast radius-aware” pipeline would:
Apply the new policy to one edge router.
Verify that sessions stay up, routes are as expected, and traffic flows as predicted.
Wait for a defined soak period.
Gradually expand to a few more edges in different regions.
Only then, roll out globally.
If at any stage you detect anomalies, such as drops in reachability, unexpected route drops, or session churn, the pipeline stops and either rolls back automatically or pages a human with precise context.
You don’t need hyperscale to do this. You just need discipline and a bit of structure:
Tag devices into groups (pods, regions, roles).
Build your automation to operate per-group, not “all at once.”
Add health checks at each stage.
Suddenly, automation stops being a grenade and becomes a scalpel.
6. From Imperative to Declarative: Intent Over Commands
The next big leap is moving from an imperative model (“run these commands”) to a declarative model (“this is the state we want”).
We’re already comfortable with this in other parts of the stack:
Kubernetes: you describe deployments and services; the system converges to that state.
Terraform: you declare infrastructure; Terraform figures out the actions needed to reach that.
In networking, we’re slowly catching up.
An imperative mindset says:
“For this new customer, I’ll configure interface x, add vlan y, configure vrf z, attach route-target, add policy-map…”
A declarative mindset says:
“Customer ABC needs an L3VPN with these prefixes, this QoS policy, and these export/import rules. Deploy that service.”
Under the hood, you still generate CLI or API calls, but your source of truth is the intent, not the commands.
This has huge benefits:
Idempotency: applying the same intent twice should result in the same state. You’re not accumulating duplicate CLI knobs.
Reviewability: you can diff intent objects, not wall-of-text configs. “We added one new RT” is easier to review than two hundred changed lines.
Portability: the same intent model can render configs for different vendors or OSs. That’s long-term leverage.
Rollback: rolling back means restoring a previous version of intent, not manually crafting inverse commands.
In practice, this often looks like:
Storing service definitions (e.g., VPNs, peering sessions, policies) as structured data: YAML, JSON, or a small database.
Having a rendering engine that takes those definitions and generates vendor-specific configurations per device.
Feeding those generated configs into your execution engine (Ansible, Nornir, custom tools) in a controlled way.
Yes, this is more work than a couple of quick scripts. But it’s also the difference between “automation that burns out its author” and “automation that becomes an internal platform.”
7. Reconciliation Loops: Detect Drift, Then Correct
Even with perfect intent models, the real world will drift.
Someone will log into a device and “just change one thing.”
A box will reboot and lose part of its runtime state.
A migration will be half-done when a higher-priority incident intervenes.
If your automation story stops at “we applied the config,” you have no systematic way to notice or fix that drift other than human intuition and occasional audits.
Reconciliation loops close that gap.
At a high level, reconciliation means:
Periodically (or event-driven) observe actual state: running-config, routing table, neighbor states, telemetry.
Compare it against the desired state from your source of truth.
For each discrepancy, decide:
Is this intentional? Then update the source of truth.
Is this accidental? Then raise an alert, or automatically correct it.
You can implement reconciliation at different layers:
Config drift: “Does this interface, VRF, or policy match what the template says it should?”
Routing drift: “Are the expected prefixes announced and received with the correct attributes?”
Service drift: “Are all the sites that should be part of VPN X actually reachable as expected?”
Even a small first step, like a nightly job that checks generated configs against show run and sends a report of differences, will uncover surprises.
Over time, you can evolve this into:
Real-time drift detection using streaming telemetry or gNMI snapshots.
Automatic remediation for low-risk deviations.
Integration with your change system: no manual edits without a corresponding intent update.
This is where automation stops being “do things faster” and becomes “keep the network healthy.”
8. A Minimal Reference Architecture for Modern Network Automation
Let’s tie these ideas together into something you can actually draw on a whiteboard and build toward.
A minimal, vendor-agnostic network automation architecture usually has these components:
Source of Truth (SoT)
This is where you store:Inventory: devices, roles, locations, capabilities.
Topology: how things connect logically and physically.
Services and intent: VPNs, peers, policies, tenants, SLOs.
It can be a database, a set of well-structured Git repos, or a dedicated tool, as long as it’s authoritative.
Rendering / Translation Layer
This layer takes intent and transforms it into:Device-specific configuration snippets.
API payloads for controllers, cloud gateways, and SD-WAN systems.
Think of Jinja2 templates, custom Python renderers, or higher-level tools. The key is that configs are generated, not handcrafted.
Execution Engine
This is the mechanism that actually applies changes:Ansible, Nornir, custom orchestrators, vendor APIs, controllers.
It runs pipelines: pre-checks → staged rollout → post-checks.
It knows how to talk to devices, but it doesn’t own the business logic.
Observability & Telemetry
All the signals we talked about earlier:Metrics, logs, flows, BMP, and synthetic probes.
Dashboards and alerts wired to SLOs.
This is not just for humans; pipelines query this during changes.
Reconciliation Engine
A process that:Regularly compares SoT to real state.
Raises drift alerts or applies corrections.
Sometimes this is tightly coupled with the controller; other times it’s a separate system.
Interfaces for Humans & Other Systems
APIs, CLIs, and maybe a simple UI for requesting services or changes.
Self-service forms for other teams: “create me a new tenant / VRF / peering” that triggers an automated workflow.
In a small organization, all of these may live in a few Python apps and a Git repo. In a hyperscale one, each may be its own platform with teams behind it.
The architecture is the same. The fidelity differs.
9. “But We’re Just a Small ISP / Enterprise”—How to Actually Start
At this point, it’s easy to feel overwhelmed. You might be thinking:
“This sounds great, but I have three people, 200 devices, and a day job.”
Fair. Really, fair enough.
That’s why you shouldn’t try to build the palace. Start with a room.
Here’s a realistic adoption path:
Step 1 – Clean up your inventory and basics
Get a single, reliable list of devices, roles, and key attributes. If it’s a CSV in Git to start, fine. The goal is “one place to look,” not elegance.
Step 2 – Wrap your existing scripts in a simple pipeline
Take a high-value, high-risk change (e.g., edge BGP policy). Instead of running your script directly, create a small driver that:
Runs a pre-check: sessions stable, no current major alerts.
Executes on a small, fixed set of devices first.
Checks key metrics and reachability after those devices.
Expands to the next batch only if checks pass.
You’re already doing better than 90% of shops.
Step 3 – Introduce declarative intent for one use case
Pick something contained: L3VPN provisioning, or WAN edge peers, or data center VLANs. Represent that service as structured data in your SoT. Generate configs from that. Treat manual CLI changes for that service as “bugs” to eliminate.
Step 4 – Add simple reconciliation
Schedule a job that compares generated configs for that use case with what devices are running. Produce a diff report. Investigate differences. Decide case by case whether to update intent or fix the device.
Step 5 – Iterate and expand
Once one use case feels solid, add another. Over time, you’ll cover more of your network with these patterns: predictable workflows, proper checks, and feedback loops.
Every step adds safety and clarity. None of them requires hyperscale budgets.
10. Career Growth: From “Scripter” to Network (Software) Development Engineer
Let’s talk about you for a second.
There’s a huge difference between:
“I write Ansible playbooks that configure BGP on our routers.”
and
“I designed and built our network automation platform. It takes business intent for VPNs and peering, renders vendor-specific configs, rolls them out safely with health checks and staged deployment, and reconciles drift using telemetry.”
The first is valuable, but fragile and very tool-specific.
The second is a career-changing profile.
When you start thinking and talking in terms of:
Source of truth and models,
Blast radius and change safety,
Declarative intent and convergence,
Reconciliation and drift management,
Observability and SLOs,
You step into the territory of Senior/Staff/Principal-level engineering. You’re no longer just pushing buttons faster; you’re designing the system that controls how buttons are pushed.
This is exactly the kind of experience that hyperscale companies, sophisticated ISPs, and serious enterprises look for when they hire for “network software engineer,” “network development engineer,” or “platform engineer” roles.
And even if you never leave your current company, this mindset shifts how leaders see you. You go from “the person who helps with automation” to “the person who is redesigning how we run the network.”
That’s a big leap.
11. Don’t Reinvent the Wheel: Stand on the Shoulders
The core message is simple:
You do not need to rediscover every lesson hyperscalers learned the hard way.
They’ve already shown that:
Script-only approaches plateau quickly and cause outages.
Real automation requires observability, blast-radius awareness, declarative intent, and reconciliation loops.
Treating network automation as a software system, not a bag of tools, is what actually scales in both reliability and human sanity.
Your network might be tiny compared to theirs. That doesn’t matter. Good design principles scale down just as well as they scale up.
You’re not trying to copy their tech stack. You’re borrowing their thinking:
See clearly (observability).
Fail locally (blast radius containment).
Describe what you want, not just how to do it (declarative).
Continuously compare reality to intent (reconciliation).
You won’t build it overnight. You don’t need to. Pick a slice of your network, apply these ideas there, and expand gradually.
That’s how you step up your game, not by worshipping tools, but by designing systems.
In the meantime, next time someone says, “We’re doing network automation now… we have a bunch of Ansible scripts,” smile and ask:
“Cool. What does your source of truth look like? How do you know if the network is actually in the state you wanted after a change?”
That’s where the real conversation begins.
Leonardo Furtado

