7. TCO and Risk Modeling: Beyond “Cheapest Box Wins”

At some point in every design review, someone will ask the question that sounds responsible but is often dangerously shallow:

“Which one is cheaper?”

If by “cheaper” they mean “lower line item on the vendor quote,” that’s the wrong question.

A network device is not a one-time purchase; it’s a multi-year cost and risk stream. It consumes power and cooling, it occupies space, it requires humans to babysit it, and when it fails or behaves badly, it sets money on fire in very creative ways.

This is why mature teams talk about Total Cost of Ownership (TCO) and risk, not just unit price.

And again, your smartphone metaphor is the same lesson in miniature:

  • The “cheapest phone” might save you money at the cash register.

  • But if the battery dies at noon, the screen cracks on the first drop, and apps constantly crash, the real cost is much higher: lost time, frustration, and replacements.

Let’s translate that into network language.

TCO 101: It’s Not Just the Box Price

Total Cost of Ownership (TCO) is everything you pay, directly and indirectly, for the lifetime of the technology:

Broadly:

  • CAPEX – what you pay to acquire it.

  • OPEX – what you pay to power, house, support, and operate it.

  • Operational cost – the hidden labor and incident cost of living with it.

Let’s break those down with network-specific knobs.

CAPEX: Chassis, Linecards, Optics, Licenses

This is the part everyone sees:

  • Chassis and fixed platforms – list price vs discount, number of slots, base system.

  • Linecards and modules – port densities, 100G/400G/800G, special feature cards (e.g., encrypted ports, deep buffers).

  • Optics and cables – QSFPs, DACs, AOCs; often a large percentage of total CAPEX.

  • Software and feature licenses – base OS, advanced features (MPLS, EVPN, SR, telemetry, security bundles), bandwidth tiers.

This is where Vendor A might look “cheaper” on paper:

  • Lower chassis and linecard pricing.

  • Fewer or cheaper licenses.

  • Aggressive discounts for landing the deal.

If you stop here, you’re basically buying gas-station phones because they’re on sale.

OPEX: Power, Cooling, Space, Support, Training

Then comes OPEX, the stuff that recurs every month or year:

  • Power:

    • How many watts per chassis, per linecard, per port?

    • Over 5 years, a high-power box can cost as much in electricity as its original purchase price.

  • Cooling:

    • More power = more heat = more cooling cost.

    • DCs have finite cooling capacity; an overheated box can force infrastructure upgrades.

  • Space:

    • RU per box, number of racks required.

    • More racks = more lease cost, more structured cabling, more everything.

  • Support contracts:

    • Vendor TAC contracts, advanced hardware replacement SLAs.

    • These scale with device count and feature sets.

  • Hardware sparing:

    • Spares you must stock for linecards, power supplies, and fabric modules.

    • Tied to your failure expectations and vendor RMA performance.

  • Training and enablement:

    • Courses, labs, certification paths.

    • The time your engineers spend ramping on a new NOS or architecture.

Some vendors win on CAPEX but lose badly on OPEX. Others cost more up front but sip power, fit more per rack, and come with simpler support models.

Operational Cost: Toil, On-Call, and Incidents

This is the part that’s rarely quantified explicitly but hurts the most:

  • Hours per change:

    • How long does it take to safely roll out a config change?

    • Is it automated, or are engineers hand-editing configs?

    • Do you need “war rooms” for routine maintenance?

  • On-call load:

    • How often are people paged?

    • How long do they spend triaging issues related to this platform?

  • Incident frequency and severity:

    • Number of P1/P2 incidents attributable to this tech per year.

    • Blast radius when it fails.

    • Mean time to resolve.

This is where box A and box B can look similar on a quote, but:

  • Box A generates a steady drip of weird bugs, manual work, and confusion.

  • Box B mostly behaves and integrates well with your automation, and it doesn’t keep people awake at night.

The difference becomes:

  • Burnout vs sustainability.

  • Firefighting vs engineering.

  • “We need more headcount just to keep this thing alive” vs “We can run this at scale with a small, sharp team.”

That’s OPEX in human form.

A Tale of Two Vendors: Cheap vs “Expensive”

Let’s tell a simplified story.

You’re choosing a new edge router or DC firewall platform. Two contenders:

  • Vendor A – Cheaper Upfront:

    • Lower chassis and linecard costs.

    • Aggressive discounts.

    • Fewer licensing line items.

But:

  • Automation support is weak or proprietary.

  • APIs are clunky or incomplete.

  • Streaming telemetry is limited; logs are messy and inconsistent.

  • TAC is slow, and documentation is shallow.

  • The platform has a history of “interesting” bugs at scale.

  • Vendor B – Pricier Upfront:

    • Higher chassis and license costs.

    • Less discount leverage.

But:

  • Strong, well-documented APIs; model-driven config.

  • Rich telemetry (gNMI, YANG, well-structured logs).

  • Proven playbooks and integrations with your automation stack.

  • Mature HA features and well-understood failure behavior.

  • Better TAC, better docs, more community knowledge.

If you only look at CAPEX, Vendor A wins. The quote is lower. Procurement is happy.

But if Vendor A:

  • Consumes 20–30% more power per Tbps,

  • Requires more boxes to achieve the same capacity or redundancy, and

  • Causes 2–3 high-severity outages per year due to immature features,

Then, over 5–7 years, Vendor A might easily cost more in TCO:

  • Higher power and cooling costs (OPEX).

  • More racks and cabling (CAPEX + OPEX).

  • More engineer-hours spent babysitting it (OPEX).

  • More outage-related costs: SLA credits, churn, reputational damage.

Vendor B’s “expensive” platform may:

  • Reduce incidents by half or more.

  • Shorten MTTR thanks to better observability.

  • Enable higher change velocity via automation (fewer manual hours).

  • Fit more capacity per rack and per watt.

If you model:

TCO = CAPEX + (Power + Cooling + Space + Support + Labor + Incident Cost) over N years,

Vendor B might offer a lower TCO curve and better ROI, despite a uglier up-front quote.

Buying purely on price is how you end up with the network equivalent of a €150 phone that dies at noon and shatters when you sneeze.

Enter FMEA: Systematic Risk Thinking

TCO is only half the story. The other half is risk.

This is where we borrow from FMEA (Failure Modes and Effects Analysis). You don’t need the full aerospace-level machinery, but the mindset is gold:

“For this technology or architecture, how can it fail? How bad is that? How likely is it? How easily can we detect it?”

Break it down:

  1. Identify failure modes

  2. Rate severity, likelihood, and detectability

  3. Plan mitigations

Let’s apply this to a router/firewall choice.

1. Identify Failure Modes

Some examples for a routing platform or firewall:

  • Control-plane scalability issues:

    • BGP session explosion under certain conditions.

    • Route reflector overload at higher route counts.

    • SPF storms in IGP due to bad design or bugs.

  • Firmware / software bugs:

    • Certain linecards or versions crash under specific traffic patterns.

    • ISSU/upgrade bugs causing disruptive restarts.

    • Memory leaks that accumulate over weeks.

  • Convergence edge cases:

    • ECMP or unequal cost behaviors causing micro-loops.

    • Graceful restart interactions with certain peers.

    • MRAI / timer interactions that prolong reconvergence.

  • Misconfiguration risk:

    • CLI commands that are dangerous and too easy to run.

    • No transactional commit; partial config changes applied immediately.

    • Poor guardrails around critical policies (e.g., default route exports).

  • Observability blind spots:

    • Missing counters for key paths.

    • No good way to trace data-plane decisions.

    • Limited error logging on critical events.

  • Integration risk:

    • New firewall that logs in a format your SIEM can’t parse.

    • Router that exports IPFIX in a non-standard way.

    • Controller that assumes all devices are from the same vendor.

2. Rate Severity × Likelihood × Detectability

You don’t need to obsess over numbers, but you can:

  • Assign Severity (S): 1–10

    • 10 = global outage, customer-visible, revenue impact.

    • 5 = localized outage or degraded performance.

    • 1–3 = minor annoyance.

  • Assign Likelihood (L): 1–10

    • 10 = happens often in similar environments.

    • 5 = seen occasionally under certain conditions.

    • 1 = extremely rare or theoretical.

  • Assign Detectability (D): 1–10

    • 10 = hard to detect; you find out when customers scream.

    • 5 = partially detectable via telemetry or alerts.

    • 1–2 = very easy to detect quickly.

Then you compute a rough Risk Priority Number (RPN):

RPN = S × L × D

For example:

  • A convergence bug that causes global route blackholes, seen in previous versions, and hard to detect until traffic is impacted, might be:

    • Severity 9, Likelihood 5, Detectability 8 → RPN 360.

  • A cosmetic logging bug that occasionally mislabels an internal event:

    • Severity 2, Likelihood 7, Detectability 2 → RPN 28.

You don’t have to worship the numbers, but they help you:

  • Focus on the high-risk failure modes.

  • Compare risks between options more objectively.

You can do this per vendor or per architecture.

For example, comparing Vendor A to Vendor B:

  • Vendor A:

    • More known high-severity bugs at scale, weaker telemetry, poor guardrails.

    • High S, moderate-to-high L, high D → multiple high RPN items.

  • Vendor B:

    • Fewer severe bugs, strong telemetry, safer config model.

    • Lower S (due to better containment), lower L, lower D → lower RPN overall.

Even if Vendor B is more expensive on paper, lower aggregated RPN across critical failure modes translates into real reduction in risk and cost over time.

3. Consider Mitigations

Once you know your hot spots, you ask:

“What mitigations reduce severity, likelihood, or detectability scores?”

Examples:

  • Lab validation:

    • Reproduce known failure modes in a realistic lab or staging environment.

    • Test upgrades, route churn, failovers, and high load scenarios.

  • Guardrails & automation:

    • Build automation that standardizes configurations and avoids dangerous commands.

    • Use templates to prevent human errors.

    • Enforce policy checks before changes (linting, static analysis, dry run).

  • Rollout strategy:

    • Start with a non-critical domain or region.

    • Use canary devices or tenants.

    • Gradually increase scope while monitoring metrics and customer impact.

  • Observability improvements:

    • Add synthetic checks, health dashboards, and anomaly detection.

    • Instrument critical paths with extra logging and metrics.

    • Integrate vendor telemetry into your own platforms.

  • Fallback / rollback plans:

    • Maintain a well-tested rollback procedure.

    • Keep dual control planes or dual firewalls during transition phases.

In the TCO + risk equation, mitigations cost time and money, but they reduce both expected outage cost and incident probability, which is exactly what you want.

Back to the Smartphone: Same Mistake, Bigger Impact

The smartphone analogy is still perfect here:

  • The cheapest phone might:

    • Have a small battery.

    • Use weak materials.

    • Run poorly maintained software.

  • The practical result:

    • You charge it twice a day.

    • You replace it when it cracks.

    • You lose time to crashes and glitches.

The up-front savings evaporate in hidden costs and friction.

In networks:

  • The cheapest box might:

    • Save you money on day one.

    • Costs you more over five years in power, space, and support.

    • Generates more outages and manual work than the “expensive” box.

At scale, that’s not just annoying. It’s existential:

  • SLA violations, customer churn, engineers resigning, stalled projects.

TCO modeling and FMEA-style risk thinking don’t make decisions for you. But they force you to look past the glossy quote and ask:

“What will this really cost us to own and operate, and how can it fail on us?”

In other words:

“Is this just a cheap phone that dies at noon and shatters on first drop, or is it something we can depend on for the long haul?”

Once you internalize that, “cheapest box wins” stops being an answer, and TCO + risk becomes part of every serious network design conversation you lead.

8. Trade Studies and Design Docs: Turning Comparisons into Decisions

By now, you’ve got all the pieces:

  • Requirements and constraints (Step 0).

  • A decision matrix to compare options.

  • A Pugh matrix to reason about deltas vs your current state.

  • TCO and risk modeling to go beyond “cheapest box wins.”

  • QFD-style thinking to map the needs to technical characteristics.

The next question is: Where does all this live?
Because if it only lives in your head or in a random spreadsheet, it doesn’t scale.

Serious engineering organizations capture all of this in a trade study or design document; 6-pager style, ADR style, call it what you want. The format is less important than the discipline:

For any meaningful architectural or technology choice, there should be a written narrative that explains what we decided, what we compared, and why we chose what we chose.

Let’s walk through how that looks with a concrete example.

A Concrete Theme: L3VPN over LDP vs L3VPN over SR-MPLS vs EVPN-based IPVPN

Imagine you’re designing a new provider core (or massively upgrading one). You need to choose the primary VPN architecture for your WAN:

  • Option 1 – L3VPN over LDP
    Classic MPLS L3VPN with IGP + LDP for label distribution, and BGP for VPN routes.

  • Option 2 – L3VPN over SR-MPLS
    Segment Routing over MPLS, with SR label stacks replacing LDP labels, SR-TE available for TE.

  • Option 3 – EVPN-based IPVPN
    EVPN used as the control plane for L3VPN/IPVPN services; potentially integrated with DC EVPN and DC–WAN interconnect.

You’re not going to pick one of these based on a meme. You’re going to write a trade study.

Here’s how that document might be structured.

1. Context and Current State

You start with the story of where you are today.

Example:

“Our current WAN core is based on IGP + LDP with L3VPN services. It was designed 10+ years ago for a much smaller customer base and simpler traffic patterns. Since then, traffic has grown 5–10x, and customers demand richer services, better convergence, and more predictable behavior under failure. We also have a growing EVPN-based DC environment that we eventually want to integrate more seamlessly with the WAN.

The existing network is stable but has known limitations:

  • Scaling pressure on LDP and IGP as we add more nodes and customers.

  • Operational complexity around RSVP-TE in specific parts of the network.

  • Limited ability to provide rich telemetry uniformly.

  • Architectural divergence between DC (EVPN) and WAN (L3VPN over LDP).”

The goal is to ground everything in real history and real pain.

2. Problem Statement and Requirements

Next, you formalize what you’re trying to solve, not just what tech you want to use.

Example:

“We need to define the target L3VPN/IPVPN architecture for a new and evolving core that:

  • Scales to N PEs, M customer VPNs, and up to X routes per VRF with headroom.

  • Provides fast, predictable convergence for typical failures (links, nodes, single-site).

  • Integrates cleanly with our EVPN-based DC fabrics and future DC–WAN interconnect.

  • Enables automation-first workflows for provisioning and changes.

  • Supports rich, model-driven telemetry for observability.

  • Lowers long-term OPEX by reducing protocol complexity and operational toil.”

You then list:

  • Goals / SLOs – capacity, convergence, availability, automation coverage.

  • Non-goals – e.g., “we are not designing an Internet-facing DDoS solution here.”

  • Constraints – existing hardware that must be reused, power/space, budget, timeline.

  • Assumptions – traffic growth rates, customer behaviors, and regulatory constraints.

This section is the anchor: every later argument should trace back here.

3. Options Considered (with Short Descriptions)

Now you introduce the candidates, neutrally.

Example:

  • Option 1 – L3VPN over LDP
    “Maintain the current L3VPN architecture based on IGP + LDP for the data plane and MP-BGP for VPN routing, with incremental improvements. This option leverages existing operational experience and avoids major control-plane changes.”

  • Option 2 – L3VPN over SR-MPLS
    “Introduce SR-MPLS as the label distribution mechanism, replacing or gradually reducing dependence on LDP. L3VPN remains the service model, with SR providing segment identifiers for paths, and a PCE optionally enabling SR-TE for traffic engineering.”

  • Option 3 – EVPN-based IPVPN
    “Adopt EVPN as the unified control plane for both L2 and L3 VPN services, using EVPN route types for IPVPN. This aligns WAN and DC control planes and can facilitate smoother DCI and multi-tenant designs across domains.”

You might mention briefly why other ideas were not considered further (e.g., SRv6 omitted for regulatory reasons or hardware limitations), but the deep “non-chosen” explanation comes later.

4. Decision Matrix + Pugh Comparison

Here, you bring in the multi-criteria decision matrix and Pugh matrix we’ve already discussed.

You define criteria (tied to requirements):

  • Functional fit (L3VPN needs, multi-tenancy, DC integration).

  • Scale (PE count, VRFs, FIB size, control-plane limits).

  • Convergence behavior (failover targets, deterministic behavior).

  • Operational complexity (protocol stack, runbook complexity, skill requirements).

  • Observability & telemetry (model-driven telemetry, BMP, IPFIX, etc.).

  • Automation friendliness (YANG models, gNMI, configuration semantics).

  • Vendor/community maturity (who runs this at scale already, operator experiences).

  • CAPEX / OPEX.

You assign weights and score each option. You don’t need to reproduce the whole matrix in the doc, but you:

  • Summarize the outcome.

  • Include the matrix as an appendix.

Example narrative:

“Based on our weighted criteria, SR-MPLS L3VPN (Option 2) scored highest overall, largely due to improved scalability, convergence behavior, and alignment with our automation goals. EVPN-based IPVPN (Option 3) scored close behind, with strong marks for DC–WAN integration but lower maturity scores in multi-vendor interoperability for our specific vendor mix. Maintaining L3VPN over LDP (Option 1) scored lowest overall, primarily due to scaling and operational complexity concerns.”

Then you apply a Pugh matrix to compare each alternative to your current L3VPN-over-LDP baseline:

  • For each criterion: + / 0 / – vs baseline.

  • “SR-MPLS is + on scale and convergence, 0 on functional fit, – on migration effort, initially – on risk.”

  • “EVPN-IPVPN is + on integration with DC and multi-tenancy, 0 on basic L3VPN functionality, – on maturity in our multi-vendor scenario.”

This helps decision-makers see:

“We’re not choosing a toy; we’re trading real improvements against real migration cost and risk.”

5. TCO and Risk Analysis

Now you integrate money and risk.

You articulate TCO differences:

  • Option 1 – L3VPN over LDP:

    • CAPEX: lowest (reuse existing HW, minimal upgrades).

    • OPEX: highest, due to continued protocol complexity, limited automation, and incident history.

    • Long-term: likely to hit scaling walls and require more incremental duct tape.

  • Option 2 – L3VPN over SR-MPLS:

    • CAPEX: moderate–high (features/linecards/licenses for SR, potential hardware refresh).

    • OPEX: lower over time, due to simplified transport, fewer protocols, and better telemetry.

    • Long-term: better scaling and more automation-friendly.

  • Option 3 – EVPN-based IPVPN:

    • CAPEX: highest initially, due to potential control-plane redesign and required software/hardware updates.

    • OPEX: potentially lowest in the long term if DC–WAN integration and shared control plane reduce duplication.

    • Long-term: strong alignment with DC EVPN, but it depends heavily on vendor and ecosystem maturity.

Then you overlay risk using FMEA-style thinking:

  • Identify key failure modes for each option (control plane bugs, migration pitfalls, operational blind spots).

  • Qualitatively rank their severity and likelihood.

  • Explain mitigations (lab testing, staged rollout, automation guardrails).

Example narrative:

“Option 2 introduces short-term migration risk due to SR-MPLS deployment and coexistence with LDP, but mitigations include a phased rollout by region, extensive lab validation, and limiting initial deployments to non-critical traffic. Once fully deployed, the risk profile is lower than Option 1 due to reduced protocol complexity. Option 3 carries higher implementation risk due to less operational experience and potential multi-vendor EVPN interoperability issues in our environment.”

This is where leadership sees:

“Oh, this isn’t ‘we want the latest fancy tech,’ this is a thought-out trade between cost, operational burden, and risk over years.”

6. Recommendation with Rationale

At this point, you make the call.

Example:

Recommendation: Adopt L3VPN over SR-MPLS (Option 2) as the target architecture for the new core.

We recommend SR-MPLS–based L3VPN for the following reasons:

  • It best meets our scale and convergence requirements while remaining reasonably complex.

  • It aligns with our automation strategy via well-supported YANG/gNMI models.

  • It improves observability by simplifying the transport layer and enabling cleaner telemetry around paths.

  • It provides a more straightforward evolutionary path from our current L3VPN over LDP network than a complete EVPN-IPVPN shift at this stage.

EVPN-based IPVPN (Option 3) remains attractive, particularly for tighter DC–WAN integration, but the maturity and multi-vendor concerns in our current environment make it more suitable as a potential future step once SR-MPLS is established.”

The key: your recommendation is traceable to everything above. It’s not “because I like SR-MPLS.” It’s “because, given our requirements and constraints, SR-MPLS is the best fit on balance.”

7. Non-Chosen Options and Why

This section is both politically and technically essential.

You explicitly document:

  • Why you’re not sticking with L3VPN over LDP:

    • “Scaling and operational limitations; higher long-term OPEX; doesn’t support our automation goals.”

  • Why you’re not choosing EVPN-based IPVPN now:

    • “Insufficient maturity in multi-vendor mode; increased migration risk; better timing once SR-MPLS is in place and DC–WAN strategy is more mature.”

This does two important things:

  1. It shows you took alternatives seriously; they weren’t dismissed out of bias.

  2. It prevents the endless “why didn’t we…” debates a year later.

Future you, and anyone new joining the org, can read this and understand: the non-chosen options weren’t forgotten; they were considered and parked for good reasons.

8. Migration / Rollout Plan

Finally, you move from decision to execution.

You sketch a high-level migration plan:

  • Phased rollout structure (by region, by POP, by customer segment).

  • Coexistence strategy:

    • LDP and SR side-by-side.

    • EVPN integration where needed.

  • Guardrails:

    • Strict change windows for initial deployments.

    • Rollback plans per step.

    • Enhanced monitoring and synthetic tests during rollout.

Example:

“We propose piloting SR-MPLS L3VPN in one secondary region with a limited subset of internal services. Following successful validation and incident-free operation for N weeks, we will expand to additional regions. LDP will remain enabled during an interim coexistence period, with clear criteria for decommissioning. Additional EVPN-IPVPN integration will be revisited once SR-MPLS is fully deployed.”

This makes clear:

  • You’re not just dropping a new architecture into production and hoping.

  • The decision includes a realistic, risk-managed path to get from “today” to “there.”

Why This Doc Matters: Audit Trail and Alignment

The most significant benefit of all of this is time travel.

Six months later, when leadership asks:

“Why did we pick SR-MPLS instead of just sticking with LDP or going straight to EVPN-IPVPN?”

You don’t shrug. You don’t say, “Because we like Vendor X,” or “Because it’s modern.”

You send them the trade study.

  • It shows the context and the pain you were solving.

  • It shows the options you considered and how you evaluated them.

  • It shows TCO and risk thinking.

  • It shows why non-chosen options were rejected at that time.

  • It shows a migration path tied to risk mitigation.

That document is:

  • An audit trail – your decision is explainable and defendable.

  • An alignment tool – cross-team stakeholders can see the whole reasoning.

  • A learning artifact – new engineers and future leaders can build on it.

This is what grown-up engineering looks like:
Not “I like this tech,” but “Here’s the problem, here’s the trade study, here’s the decision, and here’s how we execute safely.”

In the final article of this series, I will summarize everything. Stay tuned!

Leonardo Furtado

Keep Reading