In case you're trying to read this through your email, be warned: This article is large and may get clipped by services like Gmail and others.

This is a long and dense post. Since services like Gmail often clip larger emails, I strongly recommend opening it in your browser instead.

So, Is Your Organization Operating and Praising Under Heroics Culture? It's Time To Change!

In the world of network engineering, heroism is often glorified; engineers who “save the day” during outages are praised for their quick thinking and in-depth institutional knowledge.

However, behind every heroic recovery lies a broken system: undocumented configurations, tribal knowledge silos, and a culture that prioritizes reaction over resilience. And that, my friend, is a huge problem.

I chose to write and publish about this topic because I've been involved in various industries and have dealt with a wide range of complexities over time. I've seen many different work cultures and the value propositions they promote; I've been there, done that.

On both sides, I've seen relationships fail. For businesses, it means constant downtime, operational struggles, technical debt, and unhappy customers. Meanwhile, engineers can only keep up with heroics for so long before they start to burn out. As their work-life balance deteriorates, they'll regret missing out on important moments that can't be recaptured.

Based on my extensive background and experience, I have a clear understanding of what works and what doesn't, having been both a hero and a challenger. I'm confident that this post reflects the reality of many organizations, and the ideas I present here for overcoming these challenges are common in hyperscale companies.

Once you've worked for one of these giants, you realize there's no going back: the lessons you learn stay with you, especially when it comes to this mindset I'll be sharing. Once you've mastered operating this way, it's tough to accept working for organizations where tech debt, operational toil, and tribal knowledge drive everything you do.

This article dismantles the myth of the network hero, exploring how it undermines availability, velocity, and team well-being. Through real-world examples and lessons from high-scale operations, we argue for a shift toward systemic engineering practices that foster sustainability, reliability, and scale, without burning out the people who build and operate our networks.

So, enjoy the ride and let your organization learn from it.

Let's get into it.

The Myth of the Network Hero

In network engineering, the hero engineer is a respected yet often overlooked figure. Picture this: it’s 2:36 a.m. on a Saturday, and multiple alerts flood in. A vital data center link has failed, BGP sessions are breaking down, and customer traffic is halted. Complete mayhem. Then, from the digital shadows or “out of nowhere”, the hero appears, someone who has been with the network since its inception.

They log in to the device via a seldom-used, forgotten jump host, run a single CLI command to disable an obsolete policy, and restore network stability.

They receive Slack praise, possibly even a nod from leadership during the Monday morning meeting. The disaster has been averted.

But peel back the applause, and what you’re left with is not resilience. It’s roulette.

Heroics, by their nature, are reactive. They rely on luck, muscle memory, and often a brittle stack of undocumented hacks. Over time, they form a dangerous illusion: that network reliability can be maintained through instinct and experience alone.

In reality, every such save story is a symptom, a flashing red indicator that the system has failed somewhere: in design, documentation, tooling, or team culture.

This myth of the heroic networker is especially pervasive in organizations where institutional memory is concentrated in a few engineers. Their knowledge of the network’s quirks, such as legacy ACLs, routing anomalies, and vendor-specific bugs, is so deep that others hesitate to make changes without their input. It may seem efficient in the short term, but it locks the business into an unsustainable dependency model. It’s not a culture of empowerment; it’s a culture of exception handling, with a human as the exception handler.

Consider a real-world example: a senior network engineer at a global company maintained an internal DNS failover script that rerouted traffic during outages. However, it was not documented or under version control and was only stored on his laptop. When a production outage occurred and he was on vacation, he was unreachable. The script never executed, and the outage lasted 14 hours. The root cause was not hardware failure but organizational fragility, hidden for years by the heroics of a single individual.

Analogously, think of a firefighter in a city where the buildings are designed to catch fire regularly. If that firefighter is fast enough, skilled enough, maybe they can keep the damage to a minimum, most of the time. But is that how we should design cities? Or networks?

True operational excellence in networking isn’t forged in crisis. It’s earned through boring, repeatable, automated processes. It’s delivered by systems that are transparent, observable, and collaboratively maintained. And it’s upheld by cultures that reward prevention, not just rescue.

As we’ll explore in the following sections, the hero engineer may win the battle, but their existence signals a war the organization is slowly losing. If the goal is real velocity, true resilience, and sustainable scale, then the age of heroics must give way to the age of engineering systems.

“When your system relies on heroes, you’re one missed Slack ping away from a crisis.”

Key Takeaways: The Myth of the Network Hero

Heroism is reactive and unsustainable; it signals systemic failure, not resilience
Organizations that rely on “saviors” are inherently fragile and scale-poor
Institutional memory locked in individuals creates organizational bottlenecks
Every heroic save masks missing automation, documentation, or design rigor
True operational excellence is built on engineered systems, not exceptional individuals
Prevention, automation, and repeatability must be more valued than firefighting

The Rise of the Network Hero: Origins and Allure

The hero engineer didn’t emerge from nowhere; they were born of necessity. In the early days of large-scale networks, particularly in telecom, early hyperscalers, and Internet exchanges, many systems were pieced together from a mix of vendor gear, custom scripts, and informal conventions.

Documentation was sparse. Interfaces were inconsistent. If you wanted to debug a problem, you needed to SSH into half a dozen routers (or possibly more), grep through syslog files, mentally reconstruct control plane events, and hope you could reverse-engineer what went wrong.

In those environments, deep product intuition and an understanding of network behavior became a valuable currency. A select few engineers who had lived through the “birth” of the network carried this intuition like war stories. They knew, for instance, that a particular edge router would blackhole traffic if a certain prefix was advertised with a malformed MED. Or that a certain DWDM link flapped during cold weather due to physical layer quirks. These were insights you couldn’t Google. They were learned in fire.

Naturally, those who accumulated this knowledge became indispensable. When something broke, you didn’t look at runbooks: you called them. And when they fixed it, often in minutes, it reinforced their mythos. Leadership took note. Their names came up in performance reviews, escalations, and even customer meetings. Promotions followed. And slowly but surely, the message was reinforced across the org: firefighting is how you get recognized.

This mythology even shaped the structure of many networking teams. Instead of investing in platform teams, automated configuration pipelines, or scalable observability tooling, orgs doubled down on headcount, more bodies to solve problems manually. War rooms became the standard operating model. Teams were built around “go-to” engineers rather than well-defined operational processes. If a change was risky, the question wasn’t “is it safe?”. It was “Is Alice online?”

And yet, in almost every one of these environments, velocity slowed to a crawl. Releases were delayed because a single person needed to review every change. Outages dragged on because only one person knew where the correct logs lived. Automation was proposed, but never implemented, because the knowledge to build it was locked away in the same hero’s head.

One particularly telling case comes from a Fortune 100 enterprise where the lead network architect was the sole owner of the configuration logic for a core route reflector. It had evolved over a decade through dozens of incremental changes, each layered without regression testing. When that architect retired, a single routing leak during a topology migration caused hours of service impact, and no one could explain why the config behaved the way it did. It took weeks to reconstruct its logic. The fallout wasn’t just technical; it revealed a deeper issue: the organization had mistaken survivability for resilience.

The allure of the hero persists because it’s easy to quantify their output: tickets closed, incidents resolved, hours worked. What’s harder to measure, but far more critical, is the cost they impose: in bottlenecked decision-making, inhibited team growth, and fragility masked as reliability.

In the next section, we’ll dive into how this hero culture morphs into something even more damaging: tribal knowledge, a silent killer of scale, sustainability, and true network ownership.

“Hero culture is a relic of fragile systems: it thrived not because it was ideal, but because there was no alternative.”

Key Takeaways: The Rise of the Network Hero

Hero culture emerged from necessity in undocumented, fragile, legacy systems
Early network environments rewarded intuition and tenure over process and scale
“Go-to” engineers became bottlenecks as knowledge centralized around individuals
Institutional incentives (recognition, promotion) reinforced firefighting over system-building
Teams built around people, not processes, slow down change and resist automation
The illusion of resilience is dangerous: survivability ≠ sustainability

Tribal Knowledge Is The Anti-Pattern That Kills Scale

If hero engineers are the visible champions of a broken culture, tribal knowledge is the invisible foundation upon which it rests. At first glance, tribal knowledge sounds benign, even practical. It refers to the informal know-how passed between teammates, usually without documentation, often via chat, hallway conversations, or shoulder taps. But in the context of modern networking at scale, tribal knowledge is a time bomb.

Unlike codified knowledge, such as runbooks, design docs, and automation playbooks, tribal knowledge has no audit trail. It lives in people, not systems. And when those people leave, take a vacation, or simply forget, the knowledge vanishes.

This dynamic creates a dangerous asymmetry. The engineer who knows how to manually reroute multicast traffic during a transit provider peering loss might get it right 9 times out of 10. But the 10th time? When they’re out sick or unavailable? That’s the outage.

A real-world case illustrates this well. In a Tier 1 service provider, one engineer knew the undocumented sequence required to reboot a legacy aggregation switch cluster without triggering a spanning-tree reconvergence storm. It involved a specific order of interface disablements and a brief wait interval due to quirks in vendor firmware. This wasn't written down anywhere. During a routine hardware replacement, a junior engineer followed the documented process, which didn’t include this workaround, triggering a broadcast storm that took down customer access across an entire metro region for over 90 minutes.

The postmortem didn’t find a technical bug. It found a knowledge failure. Worse, it found a team culture that rewarded tribal familiarity more than systems thinking.

Tribal knowledge stifles scale in three primary ways:

  1. Operational Bottlenecks
    When knowledge resides with only a few, they become the gatekeepers of every major change. Approvals slow down. Reviews delay. Changes pile up. Eventually, momentum dies, not because of process delays, but due to fear of acting without those tribal gatekeepers.

  2. Tooling Fragmentation
    In hero-driven teams, tools evolve around individual preferences. One engineer might maintain a Python script for BGP peer health. Another uses expect scripts to automate CLI commands. None are centralized. None are tested. This fragmented ecosystem grows until even small changes risk breaking something critical.

  3. Onboarding Paralysis
    New hires quickly learn that the documented path is rarely the right one. They’re told to “just ask Jack” or “ping Priya” instead of consulting systems. This not only undermines self-sufficiency but also discourages contributions to documentation and tooling, as the reward structure is skewed. Fixing things in a fire is visible, whereas preventing fires is not.

At scale, tribal knowledge becomes a liability far greater than any technical debt. Unlike configuration drift, it doesn’t show up in a diff. It hides in plain sight, insidiously shaping how teams operate and fail.

Organizations that depend on tribal knowledge ultimately end up in a brittle state where a single resignation, role change, or PTO request can become an operational risk. When knowledge is siloed, availability becomes personal rather than systemic.

The path forward requires deliberate unlearning. It involves building systems, cultures, and tooling that encode knowledge, rather than just storing it in people's heads. In the next section, we’ll explore how this cultural debt impacts what is often considered the holy grail of modern infrastructure: velocity.

“If the answers live in heads instead of systems, your infrastructure isn’t operational: it’s oral tradition.”

Key Takeaways: Tribal Knowledge Is The Anti-Pattern That Kills Scale

Tribal knowledge is undocumented, unverifiable, and vanishes with people
It creates single points of failure and bottlenecks for change
Teams dependent on oral transmission discourage onboarding and autonomy
Fragmented tools and undocumented processes lead to operational chaos
Tribal knowledge can’t be audited, scaled, or automated—it’s anti-infrastructure
Resilient teams codify what they know and distribute that knowledge via systems

Hidden Opportunity Cost: When Velocity Dies

Every modern tech organization aspires to move fast, to iterate quickly, deploy confidently, and adapt to change with minimal friction. But in environments where hero culture and tribal knowledge dominate, velocity is the first and most invisible casualty.

At first, the signs are subtle: a new peer configuration takes days instead of hours because it needs “senior sign-off.” A routine routing policy update is delayed because the only person who understands the downstream impact is out. Over time, these delays accumulate, and what was once an agile, responsive engineering team becomes an operational bureaucracy where nothing ships without a war room.

Why? Because in hero-driven cultures, systems aren’t built to scale; they’re built to survive. And survival is reactive by nature.

One illuminating example comes from a global financial services provider migrating from MPLS to SD-WAN. The initiative promised agility: zero-touch provisioning, dynamic failover, and centralized orchestration. But under the hood, every configuration change had to be reviewed by two senior engineers who had “seen it all” during previous WAN transitions.

The change approval process became a bottleneck. The migration slipped by quarters. And worst of all, automation was sidelined because it was deemed too risky unless “the usual experts” vetted every output.

What was intended as a step toward speed became shackled by institutional inertia.

Velocity isn't just about how fast you can deploy a change; it's about how easily your system enables safe and repeatable changes at scale. And here’s the key truth: a system that relies on heroics is not a high-velocity system; it is a high-friction one.

In elite-performing engineering organizations, those that deploy hundreds or thousands of changes per day across massive networks, the common denominator is always the same: confidence built through systems. These teams:

  • Codify configuration as intent and validate it with CI pipelines.

  • Store historical decisions and rationale in version-controlled design repositories.

  • Build guardrails, not gates, so junior engineers can deploy safely without waiting for tribal sign-off.

Compare and contrast that with organizations that depend on tribal knowledge. In those environments, every change is a risk. The cost of failure is unknown, so the cost of caution becomes infinite. Engineers are conditioned to defer, to wait, to fear.

The opportunity cost isn’t just slower deployment. It’s the innovation you don’t ship. The time is not spent improving tooling because engineers are in back-to-back troubleshooting calls. The automation that never gets written because everyone’s focused on avoiding the next crisis, not engineering it out of existence.

And critically, this isn’t a failure of individuals; it’s a failure of systems design. Heroism thrives where systems are opaque, brittle, and undocumented. Velocity thrives where systems are observable, composable, and auditable.

As we move into the next section, we’ll examine how these operational anti-patterns bleed into the human domain, undermining not just efficiency, but well-being, retention, and trust within engineering teams.

“Speed dies in silence; every undocumented dependency is a hidden delay waiting to happen.”

Key Takeaways: Hidden Opportunity Cost Is When Velocity Dies

Hero culture introduces friction, slows down change, and erodes velocity
Manual sign-offs, fear of breaking things, and tribal bottlenecks delay delivery
Reliance on hero reviews prevents junior empowerment and CI/CD adoption
Innovation stalls as engineers firefight instead of building forward
High-velocity teams codify intent, automate validation, and trust their systems
Confidence through automation replaces caution driven by fragility

The Personal Toll: Burnout, Attrition, and Broken Culture

The myth of the heroic engineer doesn’t just erode technical systems: it quietly shatters human ones.

Behind every late-night login, every frantic bridge call, every text at dinner that reads “Need you to jump on now”, is an engineer carrying the weight of institutional fragility on their back. For a while, it feels like valor. Then it feels like fatigue. Eventually, it becomes burnout.

Hero culture, by design, is unsustainable. It relies on a few individuals being always available, deeply familiar, and constantly reactive. These aren’t just occasional interruptions; they’re lifestyle dependencies. Over time, this breeds chronic stress, fragmented attention, and the inability ever fully to unplug.

In one case at a major carrier, a senior engineer was informally dubbed the “BGP whisperer.” Anytime a peering instability occurred, he was called, regardless of time zone or rotation. He felt indispensable and initially took pride in it. But after three years of constant high-stakes interventions and missed family events, he burned out. He resigned with no immediate job lined up, driven solely by a desire to stop being the patch for systemic gaps.

The damage doesn’t end with burnout. It seeps into culture in more subtle, corrosive ways:

  • Psychological Safety Deteriorates: Junior engineers fear making mistakes, not because of technical risk, but because they know there’s no fallback unless a hero intervenes. This kills experimentation and innovation.

  • Knowledge Hoarding Becomes Rewarded: When recognition flows to those who “save the day,” others are incentivized to hoard knowledge rather than document it. This further centralizes expertise and undermines scale.

  • Mentorship Breaks Down: Instead of investing in enabling others, heroes are constantly in reactive mode. They don’t have time to mentor, to train, to grow the team. And when they do, it’s often through ad hoc anecdotes rather than structured enablement.

  • Attrition Becomes Contagious: When the system leans too hard on a few, and those few leave, others often follow. The gap left behind is too broad to bridge easily, and the organization enters a spiral of reactive hiring and mounting technical debt.

There’s also a more profound psychological shift: resentment. What once felt like prestige, like being needed, being relied upon, can turn into bitterness. Engineers begin to see that their heroics are not leading to systemic improvement. They’re not building better networks. They’re just preventing collapse, again and again.

And here lies the tragedy: organizations often lose their best people not because they couldn’t handle the work, but because they were asked to carry what should have been shared by systems and teams.

In modern network engineering, the goal should never be to find more heroes. It should be to make heroism unnecessary. To build cultures where people are valued for preventing fires, not for extinguishing them. To foster environments where balance, not burnout, is the norm.

Next, we’ll shift from the human to the organizational level and explore how hero culture not only harms individuals but also puts the business itself at risk, from compliance failures to systemic outages that no one can explain.

“Burnout isn’t caused by hard work. It’s caused by the absence of support, clarity, and boundaries.”

Key Takeaways: Burnout, Attrition, and Broken Culture

Hero culture burns out top engineers with endless, unsustainable escalations
Recognition skews toward firefighting over long-term system improvements
Psychological safety erodes; fear of failure and gatekeeping prevail
Mentorship, onboarding, and team growth suffer due to constant urgency
Healthy cultures reward prevention, documentation, and resilience engineering
Sustainable organizations distribute responsibility, foster safety, and celebrate balance

Business Risk: From Undocumented Systems to Compliance Failures

A hero-driven network operation isn't just a technical liability or a cultural problem; it’s a direct business risk. And unlike the visible, dramatic saves of a late-night engineer fixing a core routing loop, the risks manifest subtly, then all at once: in failed audits, SLA breaches, unrecoverable outages, and an inability to meet compliance obligations.

At the heart of this risk is a simple truth: undocumented systems are ungovernable systems. If your most critical routing policies, failover paths, or mitigation scripts live in someone’s home directory, or worse, their head, then your business continuity is built on sand.

Consider the case of a regional ISP whose backbone routing architecture had evolved over a decade with no centralized documentation. Changes were tracked through email threads and tribal memory. When a BGP policy migration triggered an unexpected path change, a Tier 1 customer experienced degraded latency for hours. The NOC couldn’t triage because the live state didn’t match any reference architecture. Worse, the network diagrams were two years old. By the time the right engineer was tracked down and the fix deployed, the SLA violation had already cost the company both a financial penalty and a lost contract renewal (at least a large one that I am aware of).

This isn’t an anomaly. It’s a pattern. When heroics replace institutional process, you lose:

  • Auditability
    Regulatory compliance in sectors like finance, healthcare, and defense mandates traceability, who made which change, when, and why. Hero-driven teams often bypass change management systems because “we needed to fix it fast.” In doing so, they create compliance gaps that can trigger fines or audit failures.

  • Change Hygiene
    In high-performing environments, changes are tested, reviewed, and deployed via infrastructure-as-code pipelines. Hero culture, on the other hand, tends to favor live CLI tweaks; quick, undocumented, and hard to reverse. Even if the change “worked,” it may have introduced invisible misconfigurations or security exposures.

  • Availability Guarantees
    SLAs are only meaningful if backed by systems that can operate predictably, even in the event of failure. Relying on tribal knowledge for failover logic means your high availability posture is only as reliable as your on-call responder’s memory and their Wi-Fi connection at 2 a.m.

  • Business Continuity and DR
    When key operational knowledge isn’t institutionalized, disaster recovery (DR) exercises fail. One Fortune 500 company discovered during a DR drill that none of the junior staff knew how to bootstrap the network from a cold start because the procedure was never written down. The only person who knew had recently transferred to another team.

These are not theoretical risks. In regulated industries, they are existential. In competitive ones, they are strategic vulnerabilities.

There’s also the reputational damage. Customers don’t care whether your outage was due to a hardware bug, a config mistake, or an internal knowledge gap. They care that their service was down, and that it took you longer than it should have to restore it. Repeated incidents tied to hero dependency quickly erode trust.

The irony is stark: the same engineers who keep the lights on through heroic effort often become single points of business failure. And by the time leadership realizes it, it’s often too late: the talent has burned out, the system is in entropy, and the business is paying the price in both capital and credibility.

Next, we’ll pivot from risk to remedy. How can we build networks and organizations that are resilient by design rather than relying on reactive measures? In the following section, we’ll look at what it means to engineer for failure and automate the toil out of operations.

“The true cost of tribal knowledge is paid not in downtime but in broken trust, lost customers, and failed audits.”

Key Takeaways: Business Risk, From Undocumented Systems to Compliance Failures

Undocumented systems are ungovernable and pose audit, SLA, and DR risks
Hero-dependent fixes often bypass change management and compliance tracking
Live CLI tweaks create untraceable changes and systemic fragility
Organizations suffer real financial and reputational losses due to undocumented behavior
Enterprise-grade reliability requires auditability, traceability, and process discipline
Proactive documentation, version control, and automation are business enablers, not overhead

Building Resilience: Engineering for Failure, Not Reaction

If hero culture thrives on chaos, resilience thrives on preparation. It’s not the absence of failure that defines a high-performing network; it’s the presence of systems that handle failure gracefully, predictably, and without panic.

Modern resilience engineering begins with a mindset shift: assume failure is inevitable. Links will go down. BGP sessions will drop. Code will contain bugs. Instead of reacting to these failures, elite teams engineer around them. The question becomes: how can the system degrade safely, recover autonomously, and notify engineers after the fact, not during a 2 a.m. firefight?

This shift requires more than tooling. It demands systemic thinking.

Take, for example, the practice of chaos engineering, once controversial but now foundational at organizations like Netflix, AWS, and Facebook. These companies deliberately inject faults into production environments: tearing down links, killing processes, and injecting latency. Not to create outages, but to discover blind spots. To ensure their systems can withstand and recover from disruptions without human intervention. And to build muscle memory into their tooling, not their people.

Contrast this with hero-driven teams who treat incidents like puzzles to be solved under pressure. They may eventually fix the issue, but they walk away with knowledge that remains undocumented, unshared, and unautomated. The system remains vulnerable. And the next failure is just waiting for its cue.

Here are the key strategies high-resilience network organizations adopt:

  • Runbook Codification
    Every recurring manual action, such as interface flap recovery, route re-convergence checks, and peering validation, is turned into an automated workflow or executable playbook. The goal is not just to document the steps, but to embed them into systems.

  • Blameless Postmortems
    Incidents become learning opportunities, not witch hunts. The focus shifts from “Who caused this?” to “How did our systems allow this to happen?” This cultural safety encourages transparency and continuous improvement.

  • Observability by Default
    Instead of relying on intuition, systems emit structured, model-driven, pipeline-ready telemetry. Engineers can query real-time state across the fleet using gNMI, OpenConfig models, or streaming telemetry. Failures become data-rich, not detective work.

  • Proactive Fault Injection
    In controlled environments, organizations regularly simulate device failures, link loss, and path degradation, not to break things, but to validate that auto-remediation, reroute logic, and alerting behave as expected. The failure is rehearsed, not feared.

  • Infrastructure as Code and Intent-Based Networking
    Configs are no longer snowflakes crafted in the CLI. They are declarative, versioned, tested, and deployed through pipelines. Engineers reason about what the network should do, not how to configure it on every device manually.

Consider Google’s production network. Their Site Reliability Engineering (SRE) teams operate under strict error budgets and assume components will fail. Automation remediates BGP route churn, re-converges traffic, and alerts only when thresholds are violated. No one needs to SSH into a router because the system has already adapted.

The result? Consistent operations, fewer escalations, improved sleep, quicker change velocity, and a culture where engineers are empowered by tools instead of being limited by tribal expectations.

True resilience doesn’t mean nothing ever breaks. It means you no longer depend on a specific human to fix it. You’ve replaced heroes with hardened systems and reactive fire drills with proactive engineering.

In our next section, we’ll examine what these resilient organizations look like, technically, culturally, and structurally. What does good look like? What does elite look like? And how do you get there?

“Don’t build systems to avoid failure; build them to recover without needing you.”

Key Takeaways: Building Resilience, Engineering for Failure, Not Reaction

Resilient systems assume failure is inevitable and build in safe degradation
Chaos engineering and failure injection expose fragility before it becomes outage
Runbooks must be automated, not just written
Observability should be model-driven, real-time, and actionable
Blameless postmortems drive systemic fixes, not blame
Automation replaces intuition; systems replace saviors

What Good Looks Like: Cultural and Technical Remedies

In stark contrast to hero-dependent teams, high-performing network organizations don’t just weather failures; they anticipate, absorb, and learn from them. They scale not through heroic effort, but through cultural rigor and engineering discipline. Let’s explore what “good” looks like in both dimensions.

Culturally: The End of the Hero Era

The most resilient organizations treat culture as infrastructure.

  • Blamelessness as Policy, Not Platitude
    Post-incident reviews never name individuals: they examine systems. When a misconfigured policy takes down peering, the retrospective investigation focuses on how the deployment pipeline allowed it, rather than who mistyped a configuration.

  • Documentation is a First-Class Artifact
    Every tribal insight, whether about a quirky route redistribution or a vendor firmware bug, is captured and version-controlled. There are no “just ask Joe” dependencies. If a procedure isn’t documented, it’s assumed to be broken.

  • Shared Ownership Over Silos
    Teams rotate responsibilities, review each other’s tooling, and cross-train regularly. On-call isn’t a burden; it’s a shared investment in uptime. Everyone participates in incident simulations and readiness drills.

  • Preventative Work is Celebrated
    Engineers who automate away manual toil, create repeatable runbooks, or improve observability get the same, or more, recognition as those who heroically resolve incidents. Prevention becomes prestige.

Technically: Engineering for Autonomy and Auditability

Elite organizations embrace practices and tooling that codify intent, verify outcomes, and eliminate ambiguity.

  • Infrastructure as Code (IaC)
    Whether configuring routing policies, ACLs, or QoS profiles, everything lives in version control. Changes go through CI pipelines that validate syntax, check compliance, and test impact. Engineers no longer “make changes”, they commit code.

  • Intent-Based Networking (IBN)
    Instead of managing configurations, teams declare the desired state of the network. Systems automatically reconcile the actual state with the intended state. For example, “this subnet must be reachable via two diverse paths” becomes a policy, not a series of commands.

  • Automated Safety Nets
    Canary deployments validate changes on a subset of devices. Automated rollbacks restore previous states if telemetry deviates from expected behavior. Engineers don’t need to watch graphs; they trust the system to enforce safeguards.

  • Observability and Telemetry at Scale
    Networks stream real-time state using gNMI or OpenConfig models. BGP sessions, interface stats, and route churn are all queryable. Engineers can write alerts as code and respond to data, rather than relying on hunches.

  • Service Ownership at the Edge
    Application teams have self-service access to provisioning APIs, pre-approved change workflows, and visibility into relevant network metrics. This reduces friction and empowers faster iteration while keeping guardrails in place.

A concrete example: At Meta, the entire production backbone is managed using a source-of-truth system layered with rigorous CI/CD. Every router is continuously validated against intent. Engineers submit changes like software pull requests, reviewed by peers, and verified by machines. Incidents are rare, fast to remediate, and well-documented. It’s not just a network; it’s a platform.

This is what good looks like. And more importantly, this is what scalable looks like.

“The highest-performing teams aren’t built on brilliance. They’re built on repeatability.”

Key Takeaways: What Good Looks Like; Cultural and Technical Remedies

Elite teams treat culture as infrastructure: blameless, documented, and shared
Preventative work is recognized equally, or more than, a reactive response
IaC and IBN remove ambiguity and enforce consistency at scale
Automated safety nets (canary deploys, rollback, validation) build confidence
Knowledge lives in systems, not people; onboarding is accelerated and scalable
Team ownership, version control, and intent-centric operations replace heroism

How Do You Get There?

Transformation starts with leadership. It requires:

  • A clear mandate to eliminate undocumented processes.

  • Investment in automation, observability, and documentation as product features.

  • A willingness to slow down to build systems that let you speed up sustainably.

  • A culture that values learning, sharing, and prevention over firefighting.

In the final section, we’ll bring this all together. We’ll dismantle the myth of heroism once and for all, and propose a path forward that values people, systems, and scale in equal measure.

Self-Assessment: Is Your Org Stuck in Hero Culture?

This self-assessment is designed to help you identify whether your network engineering organization operates under a culture that over-reliances on heroics, tribal knowledge, and reactive workflows. Check the boxes (confirm the bullets) that apply to your current environment:

Operational Dependency

  • Critical issues consistently require involvement from the same 1–3 individuals

  • On-call engineers are regularly escalated outside of rotation boundaries

  • Key scripts or tools for diagnostics/remediation are not in shared version control

  • Only certain engineers know how to perform high-risk maintenance tasks safely

Knowledge Silos

  • Recovery or escalation procedures are not thoroughly documented

  • New hires rely on asking veteran engineers instead of reading shared docs

  • There is no authoritative source of truth for design, routing policy, or topology

  • Engineers learn workarounds through oral transmission, not structured enablement

Change Management Bottlenecks

  • Changes must be manually approved by a specific senior engineer

  • CI/CD pipelines for configuration validation are absent or unreliable

  • Engineers avoid automation due to the fear of breaking undocumented logic

  • High-risk changes are delayed due to the absence of “go-to” reviewers

Cultural and Psychological Markers

  • Postmortems focus on “who made the mistake” over “how the system allowed it”

  • Engineers are praised more for saving incidents than for preventing them

  • Operational toil is normalized as a badge of honor

  • Contributions to documentation, observability, or automation are under-celebrated

Scoring

Tally the number of checkmarks to assess your organization's risk level:

  • 0–4: 🟢 Healthy Culture
    You emphasize systems, transparency, and scalable operations. Keep investing.

  • 5–8: 🟠 At Risk
    Heroics are filling in for broken or incomplete systems. You’re one failure away from exposure.

  • 9–16: 🔴 Critical Fragility
    You’re deeply reliant on human patches instead of engineered resilience. Systemic change is urgent.

What to Do Next

If your organization falls in the “At Risk” or “Critical” categories:

  • Audit and document your critical escalation and remediation flows

  • Implement CI/CD pipelines for network change validation

  • Assign owners to codify tribal knowledge into runbooks and tooling

  • Introduce blameless postmortems and rotate incident commander roles

  • Start recognizing and rewarding preventive, not just reactive, contributions

Reminder: This isn't a judgment: it's a mirror. Each checkmark is an opportunity to scale your culture, systems, and team sustainably.

Appendix: Bad vs. Good Practice Snapshots

Section

Bad Practice

Good Practice

1. The Myth of the Network Hero

Recovery depends on manual heroics and CLI hacks

Resilience via automation, runbooks, and shared tooling

Knowledge hoarded in individual memory

Codified knowledge in systems and documentation

Recognition for incident response

Recognition for prevention and the engineering discipline

2. The Rise of the Network Hero

Teams built around "go-to" people

Teams built around scalable processes

Change safety requires individual sign-off

Validated by CI/CD and peer reviews

Tooling is personal and fragmented

Tooling is shared, integrated, and tested

3. Tribal Knowledge

Recovery logic undocumented

Logic lives in the source of truth and playbooks

Onboarding by oral tradition

Onboarding through documentation and self-service platforms

Tools hidden in home directories

Tools versioned, tested, and shared

4. Hidden Opportunity Cost

Changes are slow, gated, risky

Changes are rapid, tested, and safe

Engineers firefight instead of innovating

Engineers build, automate, and improve systems

Senior engineers are bottlenecks

Senior engineers mentor and architect

5. Burnout and Culture

Engineers are always on call, with little support

Distributed rotations, strong peer coverage

Prestige via heroism

Prestige via reliability, enablement, and mentoring

High attrition, low morale

Sustainable culture with career growth

6. Business Risk

No traceability or compliance guardrails

Full version history, change tracking, and validation pipelines

DR knowledge not rehearsed or documented

Recovery plans codified, tested, and distributed

Fragile customer trust due to delayed recovery

Fast, auditable, repeatable recovery flows

7. Engineering for Resilience

Unstructured recovery post-failure

Chaos testing, proactive failure planning

Incident response manual

Self-healing and policy-driven systems

Observability ad hoc

Model-driven, telemetry-fed observability

8. What Good Looks Like

Siloed scripts, ad hoc practices

Centralized pipelines, shared ownership

Operational load centralized

Operational load distributed by design

Outdated or nonexistent documentation

Living documentation owned by teams

9. Replace the Hero with a System

System fails if key people are unavailable

System designed to succeed without intervention

Fixes rely on intuition and experience

Fixes rely on codified logic and alerting

Success is measured by firefighting

Success is measured by uptime, speed, and simplicity

The Bottom Line Is That…

“Great systems don’t need heroes. They quietly prevent the need for one.”

Key Takeaways: Replace the Hero with a System

Heroism is a symptom of system failure, not a sign of excellence
Individual brilliance cannot scale; reliance on it increases organizational fragility
Resilient engineering replaces reaction with automation, documentation, and design
High-performing teams make systems boring, predictable, and reliable
The goal isn’t more heroes; it’s fewer emergencies and fewer dependencies
Build systems that work even when no one is awake to save them

Hero-driven engineering may solve immediate problems, but it introduces severe long-term fragility. When operational success depends on individual memory and undocumented expertise, organizations suffer from bottlenecks, compliance risks, and unsustainable cultural norms.

This article explored how the culture of heroism in network engineering hinders true resilience, operational velocity, and human sustainability. It contrasts legacy heroics with modern practices, such as Infrastructure as Code, intent-based networking, and blameless post-mortems.

The goal was not to diminish the contributions of experienced engineers, but to shift their efforts toward building scalable systems that make heroism obsolete. In doing so, teams can evolve from reactive responders to resilient, forward-looking engineering organizations.

See you in the next high-signal edition!

Leonardo Furtado

Keep Reading