How Much Downtime Is Acceptable? (GKE Perspective)

How Much Downtime Is Acceptable? (GKE Perspective)

TL;DR

  • "Zero Downtime" is a vanity metric, not a business requirement.
  • 99.999% uptime allows for only 5 minutes of downtime per year.
  • GKE Reality: Zonal clusters (99.5% SLA) are significantly cheaper than Regional (99.95%+), and often sufficient for anyone who isn't a global bank.
  • The Strategy: Use Zonal compute to save cash, but keep a Regional control plane or robust DR plan.
  • Most of the time companies can afford 24 hour downtime. However it is not likely to happen.

Table of contents

  1. Google Kubernetes Engine (GKE): Zonal vs. Regional
  2. Recommendation

We cannot afford any downtime!

Until you show them the math and the bill!

Unless you are, a stock exchange, or Netflix, global bank, I can state with confidence - you probably can.

"Zero Downtime" is the most expensive requirement in software engineering.

Does your Service Level Agreement (SLA) actually promise 99.999% uptime?
Or did your CEO just say "I want it up always"?

I’ve asked many stakeholders: "How much time can we afford to be offline?"
100% of them replied: "None!"

But let’s look at the math.

Here is the difference between 98% uptime and the mythical "five nines" (99.999%):

Availability Downtime per Year Downtime per Month Downtime per Week
98% 7d 7h 18m 59s 14h 36m 35s 3h 21m 36s
99% 3d 15h 39m 30s 7h 18m 17s 1h 40m 48s
99.9% 8h 45m 57s 43m 50s 10m 4.8s
99.99% 52m 36s 4m 23s 1m 0.48s
99.999% 5m 16s 26s 6s

Most SaaS products can tolerate more downtime than their CEOs think. Here is when a single zonal GKE cluster is enough and when you actually need regional.

If your SaaS makes 1,000 USD a day and you have 12 hours of downtime, that is 500 USD lost in revenue.

Does it make sense to spend thousands per year to save that 500?
I'd argue that - no.

When you are making 100k USD per day, well that is a different story.


Google Kubernetes Engine (GKE): Zonal vs. Regional

For reference, the GKE Zonal SLA is 99.5%. vs Regional SLA is 99.95%.

Opting to go Regional doubles your control plane costs.

Also it introduces cross-zone network egress charges and complexity in storage replication.

When choosing between GKE multi-region and zonal, ask yourself (or your stakeholders):  "If we are offline for 24 hours, do we go bankrupt?"

If the answer is "No, it's just annoying", then you can safely choose a Zonal architecture.

As long as you have your Disaster Recovery (DR) plans in place (Terraform, Backups (should be Regional to survive outage)) you will most likely be fine.

Sadly, cloud providers in most cases multi region sounds like a “best practice” but in reality is additional revenue and for lean startups its an “expensive practice”.

How to Decide Your Uptime Target ?

And the rule is simple: If the system goes down at 3:00 AM on a Saturday, who needs to wake up?

  • A:  The on-call engineer must fix it immediately, then you should aim for 99.9% (You need Regional).
  • B: It can wait until Monday morning, then should aim for 99.5% (Zonal is perfectly fine).
  • C: We need it up, but we can't afford an on-call engineer. Target 99.5% (Use Zonal + Auto-healing mechanisms).

Recommendation

  • You likely don't need multi-region compute. Compute is the eating up the cost most of the time.
  • Focus on the Control Plane availability. Ensure you have resources to address an outage.
  • Go Zonal to save money. Reinvest those savings into better observability or backups.

Did you like the article? Learn more:

Read more