Rebuild vs. Refactor vs. “Leave It Alone”: How to Decide
Mature products sometimes sit on aging architecture: frameworks past end‑of‑life, monoliths that resist change, brittle build systems, and a dependency tree older than your intern. Maintenance and development costs climb. Security risk grows. Yet a full rewrite threatens years of delay and opportunity cost-time you could spend building features customers will actually notice.
This post offers a practical way to decide when to rebuild, when to modernize in place, and when to leave well enough alone. It also compiles real case studies and data points-successes and cautionary tales-to ground the decision in reality rather than dogma.
First principle: avoid the false binary
Rewriting from scratch is seductive. As Joel Spolsky put it, “the single worst strategic mistake that any software company can make” is committing to a full rewrite. (Joel on Software) He overstates to make a point, but the core warning stands: a rewrite throws away hard‑won bug fixes and institutional knowledge, and pauses feature delivery for months or years.
Equally risky is doing nothing. The “don’t touch it” approach accrues security, compliance, and staffing risk until a minor incident becomes an existential problem. Southwest Airlines learned this the hard way in December 2022: outdated crew‑scheduling software helped turn a winter storm into nearly 17,000 canceled flights, over $1B in losses, and a $140M federal penalty. (AP News)
Between those poles lies a spectrum of options. Gartner’s commonly referenced “7 Rs” of modernization-encapsulate, rehost, replatform, refactor, re‑architect, rebuild, replace-remind us there are many ways to change a system besides a big‑bang rewrite. (vFunction)
A pragmatic scorecard for your product
Use these questions to guide the decision and to make trade‑offs explicit with executives:
Security posture
Are you stuck on memory‑unsafe components where critical bugs keep recurring? Google and Microsoft have reported that ~70% of serious security bugs stem from memory safety issues in C/C++-a strong argument for moving critical components to memory‑safe languages when feasible. (Google Online Security Blog)
Android’s shift toward Rust correlates with memory‑safety vulnerabilities dropping from 76% to 24% over six years-evidence that architectural language shifts can materially lower risk. (Google Online Security Blog)
Tech‑debt drag (financial)
McKinsey reports CIOs estimate 10–20% of budgets intended for new products are siphoned to tech‑debt remediation, and that tech debt equates to 20–40% of the value of the technology estate. If your numbers rhyme with these, modernization is already “taxed in.” (McKinsey & Company)
Delivery performance (DORA metrics)
Track deployment frequency, lead time for changes, change‑failure rate, and time to restore. Persistent “low performer” status signals architectural friction that incremental fixes aren’t solving. (Google Cloud)
Cost of delay vs. time to value
Use Weighted Shortest Job First (WSJF): prioritize the work with the highest Cost of Delay / Duration. This helps weigh a rewrite (long duration) against incremental refactors (shorter duration) and pure feature work. (Scaled Agile Framework)
Talent risk
If only a handful of retirees can maintain the core and hiring is impractical, you’re carrying operational risk. Many banks still run COBOL for core systems; an estimated 43% of banking systems and 95% of ATM swipes still rely on COBOL code. (BMC)
Regulatory/compliance isolation
Can you isolate regulated components (PCI, HIPAA) from the rest so they can evolve at different speeds? Etsy famously separated its PCI‑DSS environment to keep deploying 25–50 times a day elsewhere. (Thoughtworks)
Customer impact
What measurable user pain (latency, crashes, compatibility) results from the legacy stack? If the pain is systemic and user‑visible, refactoring or re‑architecture may pay back quickly.
When you should not rewrite: lessons from Netscape and Digg
Netscape (1998–2000): Netscape paused new development to rewrite its browser (Mozilla). The market didn’t pause: IE surged, and Netscape’s share collapsed. Spolsky’s canonical post blames the rewrite for surrendering the lead-“They decided to rewrite the code from scratch.” The rest is history. (Joel on Software)
Digg v4 (2010): A sweeping redesign/rebuild shipped with removed features and instability. U.S. traffic dropped 26–30% in a single month and kept falling. Users left. The migration strategy-not just the product decisions-amplified the damage. (Yahoo)
Takeaway: Big‑bang rewrites that freeze feature delivery and alienate users are strategically perilous unless you have runway and a clear, compelling payoff.
When incremental modernization wins
The Strangler‑Fig approach (wrap the legacy system, route some requests to new services, and gradually replace) is a proven pattern. Martin Fowler describes it as “the gradual replacement of a legacy system.” (martinfowler.com)
Shopify: Rather than fracturing early into microservices, Shopify invested in a modular monolith (Rails) and tooling like Packwerk to enforce boundaries. This sustained developer productivity at scale without the operational overhead of premature microservices. (Shopify)
Etsy: Known for 50+ deploys/day, Etsy demonstrates that a well‑tended monolith, CI/CD, and strong observability can deliver velocity and safety-no rewrite required. (InfoQ)
Amazon (API mandate): As recounted by engineer Steve Yegge, Jeff Bezos mandated service interfaces between teams-forcing an internal SOA that set the stage for AWS-without stopping the world for a rewrite. (Courses at Washington)
eBay (2000s): eBay evolved its platform to Java and built a DAL to scale horizontally, supporting billions of daily page views-a multi‑year re‑architecture rather than a “burn it down” rewrite. (Cornell CS)
Tactics that make incremental change work
Strangler Fig at the edges (HTTP façade/proxy). (Microsoft Learn)
Branch by Abstraction deep inside the codebase to swap implementations while shipping. (martinfowler.com)
Monolith‑first discipline: start or stay monolithic until boundaries are proven. (martinfowler.com)
When a targeted rebuild pays off
Sometimes the architecture is the product. When core non‑functionals (speed, memory, offline, platform compatibility) determine user value, rebuilding parts-carefully-can transform outcomes.
Slack (2019): Slack rebuilt its desktop app internals to improve performance: 33% faster launch and up to 50% less memory, while improving multi‑workspace stability. Users noticed; the app felt snappier across the board. (Slack)
Snapchat Android (2019): Snap shipped a re‑engineered Android build (20% faster, 25% smaller) that reversed stagnating growth; DAUs ticked up after rollout. (TechCrunch)
Firefox Quantum (2017): Mozilla overhauled core engine components (some in Rust), yielding ~2× speedups and ~30% less memory than Chrome at the time. A strategic rewrite of hot paths, not the whole browser. (Mozilla Blog)
Microsoft Edge (2018): Microsoft re‑based Edge on Chromium to fix extension compatibility and web‑compat issues-trading engine purity for user and developer value. (Windows Blog)
Facebook (2012): Mark Zuckerberg admitted, “The biggest mistake we made as a company was betting too much on HTML5 instead of native… We burned two years.” Migrating to native mobile clients improved performance and UX. (The Verge)
PayPal (2013–2014): Rebuilt a critical app in Node.js: built twice as fast, 33% fewer lines, 40% fewer files, with 2× requests/sec and 35% lower response time vs. the Java version. (Medium)
Takeaway: Rebuilds shine when you have: (a) a contained surface area, (b) direct, measurable user‑visible gains, and (c) the ability to ship incrementally or swap in the new engine behind a façade.
When it’s rational to leave it alone (for now)
Some legacy systems are stable, deeply integrated, and mission‑critical. Killing them introduces more risk than reward-at least in the short term.
COBOL cores in banking: Despite the age, COBOL systems power a large chunk of global banking, including 43% of banking systems and 95% of ATM swipes. During COVID‑19, states even sought COBOL talent to stabilize unemployment systems. The rational path for many is encapsulation and API exposure, not wholesale replacement. (BMC)
Encapsulate mainframe logic with APIs: Tools like IBM z/OS Connect expose COBOL/CICS transactions as REST APIs-letting you build new experiences without rewriting the core. This is the Strangler in practice for mainframes. (IBM)
Leave it alone when:
It’s stable, audited, and compliant.
You can wrap, monitor, and limit blast radius.
The Cost of Delay for new revenue features dwarfs the modernization benefits. (Use WSJF to make that explicit.) (Scaled Agile Framework)
When doing nothing becomes an unacceptable risk
Security and patch hygiene: Equifax’s 2017 breach resulted from an unpatched Apache Struts flaw; total liabilities exceeded $1.3B and the reputational hit was massive. Legacy or not, inability to patch quickly is a red flag. (Oversight Committee)
Operational fragility: Southwest’s “legacy” scheduling stack failed under stress, triggering cascading cancellations and federal penalties. If your incident postmortems repeatedly identify architectural bottlenecks or manual workarounds, waiting is expensive. (Wikipedia)
A step‑by‑step playbook
1) Measure and baseline.
Adopt DORA metrics; instrument change‑failure rate and MTTR. Quantify tech‑debt spend vs. feature spend. This gives you a before/after and supports executive alignment. (Google Cloud)
2) Identify “kill switches.”
Create explicit triggers for modernization (e.g., vendor EOL; inability to patch within 7 days; CFR/MTTR thresholds; critical talent attrition).
3) Prototype modernization safely.
At the edges: Strangler Fig with a routing façade. (Microsoft Learn)
In the core: Branch by Abstraction to swap internal implementations while shipping. (martinfowler.com)
4) Run a 6–12 week spike.
Build a small, production‑bound slice of the new approach. Compare outcomes (latency, error budget, dev time) to the legacy path. Use WSJF to prioritize next slices. (Scaled Agile Framework)
5) Choose your path:
Rebuild when:
Security posture demands a language/runtime shift (e.g., moving to memory‑safe components). (Microsoft Security Response Center)
User‑visible performance or compatibility is blocked by current architecture (Slack, Firefox, Edge). (The Verge)
Talent risk is existential.
Modernize in place when:
You can isolate and replace components iteratively (Shopify/Etsy/Amazon). (Shopify)
The core domain is stable, but supporting tech (build, deploy, observability) lags-invest in platform engineering first.
Leave it (for now) when:
The system is reliable, contained, and easily wrapped with APIs (typical mainframe cores). Plan for data replication, event streams, and strangled replacements over time. (IBM)
6) Communicate with quotes and numbers.
Executives respond to crisp statements like Zuckerberg’s candid admission about HTML5-“We burned two years”-and to concrete deltas (Slack’s 33% faster). Anchor your plan in numbers, not vibes. (The Verge)
Case‑study quick hits (for your deck)
Do not freeze features for years unless the payoff is existential. Netscape’s rewrite helped forfeit the browser war. (Joel on Software)
Small, surgical rebuilds can delight users. Slack desktop, Snapchat Android, Firefox Quantum each earned material performance wins that users felt. (The Verge)
You can scale a monolith (for a long time). Etsy and Shopify prove it with disciplined modularity and tooling. (Etsy)
APIs keep legacy valuable. Amazon’s internal SOA, and mainframe API layers, let old cores power new businesses. (Courses at Washington)
Neglect is costly. Southwest’s meltdown and Equifax’s patch failure are reminders that “leave it alone” is not “ignore it.” (Department of Transportation)
Final thought
Martin Fowler’s advice to start monolith‑first and evolve boundaries is still wise. Microservices, rewrites, and new languages are means, not ends. Choose the option that maximizes time‑to‑value and risk reduction for your context, measure it with DORA and WSJF, and ship it incrementally via Strangler and Branch by Abstraction.
If your product’s core is stable and compliant, encapsulate and invest around it. If architecture blocks your users or your security team, rebuild specific parts with clear ROI. And if your instincts say “rewrite everything,” remember Spolsky’s warning-and make sure your runway, and your users, won’t run out before you land. (martinfowler.com)


