Google Cloud region currently down due to water intrusion

dang · on April 27, 2023

All: please don't post low-effort comments that merely react to the first association you have. We're trying for curious conversation here, which is something else.

https://news.ycombinator.com/newsguidelines.html

jacquesm · on April 27, 2023

This seems to significantly under-report what's going on, see:

https://www.theregister.com/2023/04/26/google_cloud_outage/

There is mention of a fire as well.

madaxe_again · on April 27, 2023

Oh, the irony.

A few years ago I implemented a top to bottom ISO27k1 ISMS for a client handling extremely sensitive and mission-critical data for industry.

One risk I recommended controls for was that of a fire and/or flood at their primary datacentre for their client-facing offerings - this datacentre. I’ve experienced the misery of a datacentre oops myself, firsthand, twice, and it’s a genuine risk that has to be mitigated.

At my insistence, I had them burn hundreds of man-hours ensuring that they could failover to a new environment in a different datacentre with a bare minimum of fuss, as what I arrived to was an all the eggs in one basket situation. It took a fair bit of re-engineering of how deployments worked, how data was replicated, how the environment was configured - but they got there, and the ISMS was put into operation, and was audited cleanly by a reputable auditor, and everyone lived happily ever after.

Except… they were acquired by private equity. Who had no truck with all of this costly prancing about with consultants and systems. Risk register? Why do we need this? What value does it add today? ISO27k1? Don’t be silly. We have that certificate. You don’t need it. Dev team, ops team, leadership — almost everyone — ejected and replaced with a few support staff.

I see their sites are down.

jacquesm · on April 27, 2023

There's that beautiful German word again... schadenfreude. I have had similar discussions multiple times in the last year and the magic thinking around the cloud is so strong that it is sometimes impossible to get through. The fact that cloud stuff can go down and that in the end it is your data and no amount of cloud credits are going to help you if your data is lost seems to be utterly beyond some people's comprehension.

madaxe_again · on April 28, 2023

In this case it wasn’t so much “the cloud is invincible!” as “this appears to be a cost centre” - their whole shtick is to boost profitability in the short term by gutting businesses, and then selling them onwards to some other finance sucker.

I am sure that my name is currently being damned in a boardroom in Chicago, as as the person who warned of this, I am likely seen as responsible.

jacquesm · on April 28, 2023

> I am sure that my name is currently being damned in a boardroom in Chicago, as as the person who warned of this, I am likely seen as responsible.

In that sense nothing ever changes. Anyway, kudos to you for staying the course on topics like these.

illumin8 · on April 29, 2023

This really has nothing to do with cloud and is more of an "all eggs in one basket" problem. I wish people would stop painting cloud itself as less capable.

The fact is, most cloud providers offer multiple regions, which have the capability of giving you more geographic redundancy than most companies that operate in their own datacenters have.

Whether you choose to adopt a multi-region or multi-datacenter architecture is really orthogonal to whether you choose cloud or on-prem.

DebtDeflation · on April 27, 2023

Plot twist: the server racks were made out of sodium.

H8crilA · on April 27, 2023

You're not far off: the batteries are (probably) made of lithium.

Also, why batteries in a datacenter? When you implement a flush() command at the lowest level you're faced with two choices: 1) actually write to disk, then return from the call, 2) write to some cache/RAM and have just enough battery locally to ensure that you can write it to disk even if all power goes out.

Then there's the other problem of surviving long enough between a power interruption and diesel generators starting up. But this is a smaller problem, rebooting all instances in a datacenter is less bad than losing some data that was correctly flush()ed by software. Bad flush() behaviour can result in errors that cannot be recovered from without a complicated manual intervention (for example if it causes corrupted and unreadable database files).

hoofhearted · on April 27, 2023

The batteries in the datacenter are simply there to hold the power until the generators are all up and running, and the phases are in sync.

They create 3 separate arrays of batteries in each back. Each array represents a power phase, A-B-C.. if I remember correctly, each array has a number of low voltage/2000 amp batteries connected in series to make up for a 2000amp 480 volt leg on the other end.

In a tier 4 plus+1 datacenter, they have 4 battery rooms and 4 generators for each data pod. You have a primary generator and UPS battery set, and a backup generator set for each pod. And then that generator set has its own primary and secondary backup set. The end result is that they can work on any piece of equipment without interrupting power. In the event they lost the primary set or needed to take it offline for maintenance, they have the whole secondary redundant set to fallback on.

The servers on the received on the power cord after it passes the switchgear never know that there has been power source changes on the other end.

hoofhearted · on April 28, 2023

I wish I understood why my apple autocorrect stopped working lol

walrus01 · on April 27, 2023

> Also, why batteries in a datacenter?

Everything serious in the telecom/ISP infrastructure sector has a big -48VDC battery plant, or preferably separate A and B side -48VDC battery plants, to provide a significant buffer between power going Grid --> AC-to-DC Rectifiers --> Equipment, and when a generator can start up, warm up, and transfer switch does its job.

Even if a bunch of servers don't have any UPS or battery backup because they're designed to tolerate individual node (or whole rack, or whole row failures) the core network equipment in a datacenter will still have a huge battery plant.

Ideally if you have a chilled water loop for cooling you do not want it anywhere near your big-ass racks of batteries. Or near the racks that contain the rectifiers and DC breakers, distribution bus bars.

If you look at the battery racks in a traditional telco CO in the US for instance you will see that all of the cabling and batteries are a minimum of 1 foot off the floor, so that the whole place could theoretically flood and the DC distribution would remain unaffected. Same principle that applies to very traditional setups with wet-cell 2V lead acid batteries also applies to more modern things if building from scratch.

eep_social · on April 27, 2023

Very different trade offs in play for google who run with a relatively high tolerance for failure at the individual machine or even rack level. At one point I believe there were batteries in every rack, though I don’t know what they're building these days. A telco DC is gonna have more network interconnect with lower tolerance for failure due to capacity impact that isn’t easy to double.

Think like a fiber termination demarc vs an in-cluster mesh.

manquer · on April 27, 2023

Google cloud cannot run high tolerance failures . Google the product wouldn’t notice region or zone going down , google cloud customers will .

walrus01 · on April 27, 2023

What I was saying above is that the 'core' of a google DC has a massive amount of network interconnect and needs for battery backup not very different from a big IX point or traditional "primary CO" for a city in a telco environment.

By square footage maybe 95% of a google DC might have no UPS or battery backup but the core network for things like routers and DWDM transport equipment absolutely will have such.

If they were unlucky enough that the burst cooling loop met with the battery plant for the core gear in a building or small campus of buildings....

ironmagma · on April 27, 2023

NaCl, the revolutionary Sodium Cloud technology.

pcurve · on April 27, 2023

The outage has been going on for 40+ hours now...

I think this is sort of big.

jonatron · on April 27, 2023

This doesn't sound as bad as OVH's 2021 fire.

nik736 · on April 27, 2023

Well, we had pictures very quickly of the OVH fire. Google seems to be not very transparent on what is exactly happening...

stingraycharles · on April 27, 2023

The linked article says that there was a leak in a water cooling system, which in turn ended up in the battery system which caused a fire. But yeah it’s not coming from Google but second hand reports.

mike_hearn · on April 28, 2023

It's not second hand, it's from the colo provider they're using in Paris.

sschueller · on April 27, 2023

If you trench a fire in water in a DC it might be just as bad.

jacquesm · on April 27, 2023

I wouldn't draw any conclusions just yet.

t0mas88 · on April 27, 2023

I can't ignore the feeling that Google Cloud is sub par compared to AWS. How did this again cause a multi zone failure. Why haven't they fixed those dependencies the last few times they had a full region failure.

dekhn · on April 27, 2023

zones and regions have different definition in google cloud than AWS. Multiple zones are physically co-located and are not truly availability zones because the physical proximity causes shared fates even when they have independent systems (network, power) that should allow one to fail while another doesn't. Even two datacenters in the same city are prey to the same meteor.

deathanatos · on April 27, 2023

> A cluster represents a distinct physical infrastructure that is housed in a data center.

> Google designs zones to minimize the risk of correlated failures caused by physical infrastructure outages

And they have stated that the flood "caused a multi-cluster failure".

> Zones should be considered a single failure domain within a region.

(—GCP's documentation.)

illumin8 · on April 29, 2023

I guess you just discovered the difference between a GCP Zone and an AWS Availability Zone.

By definition, AWS availability zones do not share fault domains, other than a geographic region up to hundreds of miles wide. Even for services used by multiple AZs, such as transit to the Internet and other regions, there are two transit centers operating in separate fault domains.

In contrast, many GCP Zones share the same physical datacenter. What they are actually providing you are simply different racks, rows of racks, or rooms in a single physical facility. Caveat emptor.

sitkack · on April 28, 2023

"should be considered" -> we would like you to think

Zones are an artificial construct.

ericpauley · on April 27, 2023

Wow, this is a big issue! Is there any way to guarantee AWS-level physical redundancy in GCP without paying the latency/inter-region data transfer (higher than inter-zone outside US: https://cloud.google.com/vpc/network-pricing#egress-within-g...) pricing?

A nice thing about EC2 is that you're getting a pretty dumb, predictable service. There have been multi-zone or global control plane issues but the physical metal has bona fide redundancy between zones/regions.

lamontcg · on April 27, 2023

I don't know what it is like these days but us-east AZs used to be in different datacenters that were on different flood plains and power companies. They were just on a very high capacity (for the day) fiber ring. You still had a couple miles of light-delay latency in between them. A sufficiently big enough meteor, or a powerful enough massive hurricane, could probably take out multiple ones at the same time.

londons_explore · on April 27, 2023

I believe the Google design is one big pool of machines, perhaps spread across a few buildings, but they hope that any failure only affects a few racks.

They will arrange/move workloads such that any one customer will only see an outage in one 'zone'.

Clearly that didn't work here,

local_crmdgeon · on April 27, 2023

That's .... that's not a great strategy. I get that it's cheap, but incidents like this will cost a LOT more with lost business.

styren · on April 27, 2023

In the same building is quite a bit different than 10km apart, even if a huge meteor would lead to the same conclusion for both scenarios.

secondcoming · on April 27, 2023

Then why do they charge extra for cross-zonal traffic?

kccqzy · on April 27, 2023

Presumably for inter-zonal traffic, it uses up bandwidth on their B4 network, but intra-zonal traffic does not and has basically unlimited bandwidth.

B4: https://storage.googleapis.com/pub-tools-public-publication-...

dilyevsky · on April 27, 2023

This is not going through b4 but inter-zonal links are still choke points because they are separate networks

ec109685 · on April 27, 2023

It was not only a region outage, but a global outage!

"GCE Global Control Plane: Experienced a global outage, which has been mitigated. Primary impact was observed from 2023-04-25 23:15:20 PDT to 2023-04-26 03:45:30 PDT and impacted customers utilizing Global DNS (gDNS). A secondary global impact for aggregated list operation failures for customers with resources in europe-west9 has also been mitigated. Please see migration guide for gDNS to Zonal DNS for more information: https://cloud.google.com/compute/docs/internal-dns#migrating... "

Embarrassing.

influx · on April 27, 2023

AWS has had datacenters that were flooded, but they failed over and physically moved racks via trucks. Customers never knew.

local_crmdgeon · on April 27, 2023

This has happened much more than people know.

Aeolun · on April 28, 2023

If it happens without me knowing then they’re doing their job very well.

1letterunixname · on April 28, 2023

Well, GCP did fire most of their senior customer techs 18 months ago.

Alphabet is most ruled by arbitrary fiat of empire-builders.

Killing products and footgun changes are standard.

nonethewiser · on April 27, 2023

> I can't ignore the feeling that Google Cloud is sub par compared to AWS.

That goes without saying at this point. More importantly, it’s proven worse than Azure.

outworlder · on April 27, 2023

Depending on the metric. I have not seen any incidents where the isolation between customers was breach. Azure had several. Their compute offerings are better. We could go on.

On the other hand, Azure was(and still is) upfront about not having AZs - now that they have rolled out, hopefully those are not in the same building.

radicaldreamer · on April 27, 2023

Not sure what kind of fire there was there, but once those automatic sprinkler systems get going, they are very difficult to stop.

Someone in my freshman college dorm decided to use one as a clothes hanger hook and broke the thermometer in there. The sprinkler damaged the entire floor with water and the floor below had spotty rain as well.

The fire department came and was mainly concerned about evacuating everyone rather than shutting the water off.

The water is typically chemically treated and has been sitting there for years as well -- very nasty stuff.

sbierwagen · on April 27, 2023

Always worth tracking down the sprinkler shut off valve in your residence/place of work. If you’re in a high rise it‘ll be the big red wheel on the sprinkler main in the fire stairs. If it’s a spurious activation you can just shut it off yourself, you don’t need to ask anybody’s permission.

The fire department is always going to prioritize safety of life, and after all it’s not their stuff getting soaked.

Kon-Peki · on April 27, 2023

> The fire department is always going to prioritize safety of life, and after all it’s not their stuff getting soaked.

They won't hesitate to smash your stuff or break down your walls either.

Being in a fire is no joke. You've got to be crazy to think that your stuff is important. It's not.

rectang · on April 27, 2023

Fires develop crazy fast, too. Horrifying real-time footage of The Station nightclub fire is on YouTube. Within 2 minutes of ignition, fire is leaping out the windows. By 6 minutes in, the entire building is burning like a torch.

https://www.youtube.com/watch?v=rO0ioCCiEe8

Firefighters arrive in 5 minutes.

dsfyu404ed · on April 27, 2023

Anyone talking up firefighters like this clearly hasn't been around them much.

They're boys with toys that they don't frequently get to use and they work for the government. Follow the incentives. They'll do their jobs but they don't give a lot of fucks about things like "unnecessary property damage" and "other people's financial well being" and anything else not written in their KPIs.

I used to drive tow truck. I can't count the number of cars they totaled peeling the roof off (granted some were totaled anyway) because that was easy and cutting a door off was hard. And don't get me started on them and their stupid stands they use to prop shit up in the most questionable of ways...

local_crmdgeon · on April 27, 2023

Firefighters are personally incentivized to take photos like this: https://nypost.com/2022/01/10/fireman-in-post-photo-recalls-.... They are not incentivized to protect your stuff, that's what insurance does.

Good. I want you to save me and my family from dying in a fucking fire. This thread is obsessed with saving Funko Pop collections for some reason??

sbierwagen · on April 28, 2023

Hello 16 day old account. Can't imagine why you've been banned before. https://news.ycombinator.com/item?id=35714352

local_crmdgeon · on April 28, 2023

I make many accounts because I don't like people snooping through my history. The Redditism of "I've looked back 10 years in your history, and will dismiss your comment because of something you said in 2015" is not a discussion I'm interested in.

dsfyu404ed · on April 27, 2023

That's just a textbook appeal to emotion.

Most of their calls are mundane stuff. And they leave a pretty decently wide path of destruction in doing that. We're talking like mundane situations where there is no urgency and no need to tear shit up in the interest of time.

I once arrived to a minor rollover after the cops but before fire. Nobody injured. Occupant trapped because she was a large lady and couldn't release her seatbelt upside down and was having difficulty unlocking the car because side curtain airbags.

I offered to flip the car and treat it like a lockout. "Customer" was fine with it. Cop was iffy. FD showed up, didn't want to hear it, broke the window, unlocked the car, opened the door, cut her belt rather than release it and dropped her on her face and then had difficulty getting her out. Now I get that they have "procedures" but this seems like a forest for the trees situation.

Or they'll show up, shut down two lanes for a minor fire on the shoulder and not move the trucks until the car is loaded on a tow truck and gone. Supposedly it's to keep them safe from being hit by traffic. Meanwhile here I am not blocking traffic to recover shit that broke down.

Sure, they'll save a life if the situation presents itself but they sure don't care about being tidy about it.

riceart · on April 28, 2023

You have any real data to back this up rather than some anecdotes?

You sound really butthurt.

newZWhoDis · on April 27, 2023

True, and if you mess up how are they gonna know?

Your fingerprints won’t survive the fire!

local_crmdgeon · on April 27, 2023

Please do not disable or tamper with your buildings sprinkler systems.

You will not care about your stuff when you're in jail for negligent manslaughter.

deagle50 · on April 27, 2023

This happened in my freshman dorm as well. The broken sprinkler was on the 3rd or 4th floor and my room which was on the 1st got at least an inch of water.

mvanbaak · on April 27, 2023

datacenters dont use sprinkler systems (or at least they should not).

packetslave · on April 27, 2023

A non-water fire suppression system for a 300,000+ square feet warehouse-scale datacenter would be incredibly expensive.

Aeolun · on April 28, 2023

Would it be as expensive as replacing 300,000+ square feet of servers?

packetslave · on April 28, 2023

People who aren't me get paid to do that kind of math when designing datacenters. Presumably there's a break-even point.

jamesfinlayson · on April 28, 2023

I have no idea what's used these days but I remember someone telling me the dry powder that is sometimes used is very bad for computer hardware - not immediately, but within a couple of years the metal will show obvious signs of reaction with whatever is in it.

glogla · on April 27, 2023

Yeah, I always thought datacenters would use Halon. It of course has the problem of suffocating everyone.

packetslave · on April 27, 2023

Halon has been banned for years because 1) it's bad for the ozone layer and 2) it'll kill you. Newer systems (FM-200, Inergen, etc.) fight the fire by removing heat instead of removing oxygen.

alwayslikethis · on April 27, 2023

Halon is still used. Unfortunately the same properties that makes it effective also makes it harm the ozone layer. It does not just remove heat or oxygen, it directly interferes with the reaction involved in combustion, making things stop burning.

dx034 · on April 28, 2023

Wasn't there also this technique of lowering oxygen levels so much that humans can still survive but fire won't spread as fast? Or did this turn out to be too expensive?

palcu · on April 27, 2023

[disclaimer: SRE @ Google, I was involved with the incident, obvious conflicts of interest]

Hey Dang, thanks for cleaning up the thread. One thing to note is that the title is not correct. The entire region is not currently down, as the regional impact was mitigated as of 06:39 PDT, per the support dashboard (though I think it was earlier). The impact is currently zonal (europe-west9-a), so having zone in the title as opposed to region would reflect reality closer.

Finally, there's lots of good feedback on this thread and on the previous one (https://news.ycombinator.com/item?id=35711349), so we obviously have a lot of lessons to learn.

Waterluvian · on April 27, 2023

Would you be able to comment a bit on the emotional (perhaps there’s a better word) aspect of the response?

Was there a lot of anxiety? Panic? Or was it just a “woof that sucks. Time to follow a checklist and then do a bunch of paper work” ?

What I’m curious about is what it feels like on a team at a company like Google when there is a major system failure.

palcu · on April 27, 2023

There's not much emotion as the core team working on the huge outages is more like an "SRE for SRE". They are all people who've been with the company for a long time and they've been in the secondary seat for at least one previous big rodeo. Not to mention that we're all running a checklist that has been exercised multiple times and there's always somebody on the call who could help if a step fails.

Personally, I wasn't part this time for the actual mitigation of the overall Paris DC recovery, as I was busy with an unfortunate[0] side effect of the outage. These generate more anxiety, as being woken up at 6am and being told that nobody understands exactly why the system is acting this way is not great. But then again, we're trained for this situation and there are always at least several ways of fixing the issue.

Finally, it's worth repeating that incident management is just a part of the SRE job and after several years I've understood that it is not the most important one. The best SREs I know are not great when it comes to a huge incident. But, they're work has avoided the other 99 outages that could have appeared on the front page of Hacker News.

[0]: https://news.ycombinator.com/item?id=35734224

Waterluvian · on April 28, 2023

I appreciate your insight into this. Thanks!

throwbigdata · on April 28, 2023

Who trains the trainers?

palcu · on April 28, 2023

Life and experience, if you're looking for a short answer. For example, last year we had an outage in London[0] and the folks who worked on it learnt a lot. Now, they applied the learnings in this incident.

[0]: https://news.ycombinator.com/item?id=32161755

rickette · on April 27, 2023

Anyone know how this could affect multiple zones? "Customers can failover to zones in other regions". Unless a whole area got flooded.

gst · on April 27, 2023

There's an interesting Twitter thread about that topic here: https://twitter.com/GergelyOrosz/status/1651256082424012806

Based on that thread it sounds like only AWS guarantees that their AZs are in physically separate DCs, while for Google and Microsoft AZs could be in separate buildings of the same DC facility.

JCM9 · on April 27, 2023

Yes. Azure and GCPs numbers on the size of their AZs and such are more marketing spin than hard engineering. AWS keeps these in separate physical locations to provide true separation. While there have been tech related regional incidents at AWS a physical event disabling multiple AZs would be extremely unlikely given their much more robust and geographically distributed design. If such a physical event had happened in AWS it would have been a non-event with things just failing over to other AZs.

Other cloud providers mostly just vaguely put things in another part of the building and say it’s “a separate AZ” but as GCPs woes highlighted that’s corner cutting that bites badly when the whole building has a problem.

outworlder · on April 27, 2023

> If such a physical event had happened in AWS it would have been a non-event with things just failing over to other AZs.

In many cases in AWS an availability zone is actually composed of multiple datacenters, each with their own redundancies. This may not be true for smaller regions, but in large ones it definitely is. In those cases, losing an entire datacenter would maybe take out a percentage of instances in that AZ. This has happened before and our production systems barely noticed other than provisioning new nodes to replace the failed health checks.

kyrra · on April 27, 2023

Googler, opinions are my own.

I think you misunderstand Google's infrastructure. I'm guessing that each GCP zone is actually a Borg Cell (see: https://storage.googleapis.com/pub-tools-public-publication-... ). Borg cells tend to be isolated from eachother in many ways in the physical layer (networking and management being a big one, not sure about power). So networking or machine management for an entire zone could go down and not affect other cells. Changes also tend to get pushed on a per-cell basis when they are Google wide rollouts.

I believe GCP recommends to replicate data cross regions (https://cloud.google.com/architecture/framework/reliability/...).

Also see: https://cloud.google.com/architecture/disaster-recovery#regi...

stingraycharles · on April 27, 2023

I don’t know what you’re trying to say with Borg cells, the point of discussion is not that the network etc are separated, but that they’re physically separated in such a way that these kind of flooding wouldn’t affect different AZs, and that GCP is cutting corners here.

Obviously every cloud vendor recommends replicating data between multiple regions, but fact of the matter is that a lot of cloud services work much easier with redundancy within a single region than multi-region redundancy.

kyrra · on April 27, 2023

I guess it's different types of concerns. My feeling is that Google tries to optimize the resources of a datacenter, and the larger it is, the better things can scale. GCP Zones provide logical separation of machines for management (and network). There may be physical separation, but within a given region, GCP does not advertise this.

I think Google designs their datacenters for their own needs and expect you (a product running in their DCs) to distribute by region. Almost products at Google will be operating in multiple regions given the reach of most of our services, so DC design followed that need.

Based on GCP's docs, they still think region separate is better. Not sure why you wouldn't just do that?

If there is a catastrophic event (a large tornado hit AWS us-east-2), those buildings are pretty close to one another and both likely would be taken out, right? So you could lose multiple AZs since they are physically located so close to one another?

JCM9 · on April 27, 2023

Yeah, you’re not getting what people are saying. AWS’s AZs are much more separated than GCPs. Your recommendation that one could build across regions isn’t what folks are talking about here since there is a big benefit to having geographically separate AZs in the same region. That’s where GCP is falling short here.

mcast · on April 27, 2023

AWS treats its availability zones very seriously, each zone has its own independent power substation, air conditioning, and fiber lines.

It's incredibly rare for multiple AZs to go down at once, especially since they are more than a few miles apart from each other.

kevan · on April 27, 2023

Funnily enough floods (GCP) and fires (OVH) are two of the 3 things AWS explicitly mentions in the Well Architected docs. For a lot of companies an AZ going down is an annoyance or bad day but a whole region going down could be a real continuity risk.

> Each Availability Zone is separated by a meaningful physical distance from other zones to avoid correlated failure scenarios due to environmental hazards like fires, floods, and tornadoes.

https://docs.aws.amazon.com/wellarchitected/latest/reliabili...

jamesfinlayson · on April 28, 2023

> but a whole region going down could be a real continuity risk

Very much so - Australia only got a second region this year, so if your work required data to remain in Australia, you just had to hope that ap-southeast-2 didn't have a major issues. I'm sure there are plenty of other countries with only a single region.

sokoloff · on April 27, 2023

Unless they’re in us-east-1 and it’s an Amazon software/service fault.

local_crmdgeon · on April 27, 2023

This. Don't use us-east-1, it's by far the flakiest. PDX is also a bit rough, but Ohio is golden.

joelrwilliams1 · on April 28, 2023

Ohio has tons of problems, no one should ever put their infra in us-east-2 (shhhhhh...don't let the secret out )

simonebrunozzi · on April 27, 2023

And different flood planes. Source: I was at AWS 2008-2014.

rodgerd · on April 27, 2023

It makes it very easy for me (as someone who comes from a world of physical datacentres) to reason about what an AZ is getting me, and also to understand the benefits of using AWS (not having to think about the details of power routing, blade switch vs top-of-rack vs core switch, storage cabling, blah blah blah).

If I have to think too hard and do too much work about how I lay applications out, I might as well just rent in a colo.

fnordpiglet · on April 27, 2023

It’s more than that. AZ’s are geographically distinct as well along multiple dimensions include flood plains etc.

yegle · on April 27, 2023

For physical zone separation you need to check the `supportsPzs` attribute when listing the zones (e.g. https://cloud.google.com/compute/docs/reference/rest/v1/zone..., but you should be able to find many other places where this attribute is surfaced).

It says "reserved for future use" but other docs mentioned "physical zone separation": https://googleapis.dev/java/google-api-services-compute/alph...

outworlder · on April 27, 2023

Random datacenters should start advertising availability zones since they should have different fault domains anyway. Google can get away with this, why can't smaller companies?

xyst · on April 27, 2023

You would think that the company that literally wrote the book on “Site Reliability Engineering” would actually follow their own recommendations.

londons_explore · on April 27, 2023

Googles advice is not to rely on uptime in every region.

Instead aim for uptime in a few regions, and load balance your users to regions that are healthy.

That design is far cheaper for both google and for you - and, in the typical case, users still get nice low latency to a local datacenter, and only in the rare failure case might they have to wait for latency to some other region.

kevincox · on April 27, 2023

They do internally. But when customers want 3 zones in Indonesia they cut corners.

ddol · on April 27, 2023

Do Google host their own products on Google Cloud, or are there different sets of data centres for Search/Drive/Gmail vs Google Cloud Customers?

jeffbee · on April 27, 2023

This is a leased facility, the kind of place Google rents for cloud customers but doesn't need for itself. Google's own datacenters are https://www.google.com/about/datacenters/locations/

packetslave · on April 27, 2023

Other way around. Google Cloud runs on the same underlying datacenter, compute, and network infrastructure that Search/Drive/Gmail does.

[edit: at least in regions where Google HAS its own datacenters, e.g. "us-central-1? yes. europe-west9? maybe not"].

That does not imply that Search / Drive / Gmail runs on top of Google Cloud.

CydeWeys · on April 27, 2023

The recommendations are to run in multiple regions if you need this kind of redundancy. Run everything in a single region and you can be affected by an event like this.

migf · on April 27, 2023

It amazes me that in every market they serve, Amazon has no actual competitors from a feature perspective.

Like, Target does not compete with Amazon. They have a totally different home delivery model that is not in the same category of reliability or service.

It's annoying.

londons_explore · on April 27, 2023

I think it's because lots of amazons services are in 'winner takes all' markets.

No random online eshop can offer next day delivery across half the world unless they already have a logistics chain of 100,000 truck drivers spread across the world. But Amazon can.

Likewise, no cloud provider has enough data centers to offer multiple separate data centers in the same city, for hundreds of cities around the world. But Amazon does.

Any competitor can't offer amazons level of service until they get to amazon scale... Which they never will.

packetslave · on April 27, 2023

Or even in the same building, just with a different power/network domain.

rickette · on April 27, 2023

Ah I see, I know Azure and GCP in NL are in separate buildings but indeed on the same site. But that's not guaranteed for other regions, good to know.

outworlder · on April 27, 2023

I would really like to see the physical DC separation at "The Dalles, Oregon".

blacksmith_tb · on April 27, 2023

Looks like there are three buildings[1] to me, not entirely sure what goes where, obviously.

1: https://goo.gl/maps/Tfw5UpSsoYiN3YMVA

asymptotic · on April 27, 2023

When I worked at AWS there was a similar scenario in eu-west-2. There was a fire in one of the availability zones (AZs). The fire suppression system kicked in and flooded the data center up to ankle or knee height. All the racks were powered off and the building was evacuated for hours (I don't remember the duration of the evacuation) until the water was pumped out.

But for the service team I worked for, our AZ-evacuation story wasn't great at the time and it took us tens of minutes to manually move out of the AZ, but at least there wasn't a customer-visible availability impact. Once we did it was just monitoring and baby-sitting until we got the word to move back in, I think it was 1-2 days later.

If you operate on AWS you work with the assumption that an AZ is a failure domain, and can die at any time. Surprisingly many service teams at AWS still operate services that don't handle AZ failure that well (at the time). But if you operate services in the cloud you have to know what the failure domain is.

jamesfinlayson · on April 28, 2023

> urprisingly many service teams at AWS still operate services that don't handle AZ failure that well (at the time)

Ouch, hopefully none of the major services? I recently had to look into this for work (for disaster recovery preparation) and it seemed like ECS, Lambda, S3, DynamoDB and Aurora Serverless (and probably CloudWatch and IAM) all said they handled availability zone failures transparently enough.

asymptotic · on April 28, 2023

I’m familiar with Lambda and DynamoDB. When I left in 2022 they both had strong automated or semi-automated AZ evacuation stories.

I’m not that familiar with S3, but I never noticed any concerns with S3 during an AZ outage. I’m not at all familiar with Aurora Serverless or ECS.

For all AWS services you can always ask AWS Support pointed, specific questions about availability. They usually defer to the service team and they’ll give you their perspective.

Also keep in mind that AWS teams differentiate between the availability of their control and data planes. During an AZ outage you may struggle to create/delete resources until an AZ evacuation is completed internally, but already created resources should always meet the public SLA. That’s why especially for DR I recommend active-active or active-“pilot light”, have everything created in all AZs/regions and don’t need to create resources in your DR plan.

jamesfinlayson · on May 1, 2023

Okay good to know - Lambda seemed to suggest it could handle an availability zone going down without any trouble.

ECS Fargate's default is to distribute task instances across availability zones too but I'm assume if you use EC2 it might not be as straight-forward.

And that makes sense - I remember during the last outage that affected me it was a compute rather than data failure and the running stuff continued fine, just nothing new was getting created.

richardw · on April 27, 2023

I can imagine clients who used one DC being impacted. But Google’s services would be designed for a single DC going down, right? Data would be eventually consistent (once they find and plug the hard drives in) but isn’t this the promise of the cloud and they’re (approximately) the best at using it.

I have to assume it’s a fault that not even distributed services can paper over. Eg lots of crucial data in flight and they’re reluctant to drop it. Can an expert weigh in?

I love Google’s post-mortems. This one will be epic.

kccqzy · on April 27, 2023

> But Google’s services would be designed for a single DC going down, right?

Right. But nobody forces GCP's customers to design their services to be tolerant of a single DC failure. In fact as a business, actively not designing for such tolerance is an attractive cost-cutting measure.

outworlder · on April 27, 2023

Cloud customers have no control on which or how many 'datacenters' are used. That's not something that's even advertised or easily available to customers.

The logical units are regions and availability zones or the equivalent nomenclature in each cloud. One availability zone is expected to be one or more datacenters.

We have thousands of instances in AWS. I do not know - or care - where they are physically located(other than the region name, say, Oregon). I expect at most one availability zone to get impacted if a datacenter goes up in flames (and sometimes, just a portion of one). I mention in another comment that AWS has had issues before and production systems barely got impacted. And recovered with zero intervention - instances with failed health checks get replaced by brand new ones in whatever AZs are still operational.

> Data would be eventually consistent (once they find and plug the hard drives in)

At the level of abstractions cloud operates, no-one is plugging drives in – someone is, but you can never see it.

Most cloud workloads use network attached storage - when you can even see the logical drives (SaaS offerings may not even have that abstraction). We don't know (or care) how many physical hard drives exist, or where they are. Latency requirements probably dictate that they are close to the actual instances, but there's usually data replication going on even across DCs.

In addition to that, at least in AWS, if you have saved any volume snapshots at all, they will be in S3. This data will be replicated and underlying systems can even use it to restore lost or corrupted data without you even noticing and sometimes even without a recent snapshot, as storage keeps track of what blocks have been rewritten since the last snapshot. In a particularly bad case you might have to do a restore.

In almost a decade and number of volumes in the 6 digits (no clue how many drives that is!) we never had a single volume fail on AWS. Some got into a 'degraded' state and then recovered.

We haven't had any failures on GCP either. In the case of GCP, even faulty hypervisors are transparently worked around - we never notice other than some audit logs saying the VM was moved. They even preserve the network connections. AWS requires a stop/start to do the same, but your VM will be up and running in a different hypervisor (sometimes a different datacenter) in a couple of minutes, with all the storage.

Mind you, AWS promises eleven nines(!) of durability for S3.

When you do have locally attached storage, it's treated as ephemeral and it's gone if the instance restarts.

> I have to assume it’s a fault that not even distributed services can paper over.

If a single datacenter fails, since it _should_ be at most one AZ(this case seems to be different) that will depend on how the application is architected. Requests in flight will obviously fail, how big of a deal depends on the problem domain. For most web apps, this will cause a retry and that's the end of the story, others will be specifically engineered to deal with receiving multiple messages or dropping messages. For example, if you need at most once delivery guarantees, you need to take extra measures

Not all applications can survive an entire region going down. Some can, but that usually raises costs if you are continuously replicating data across regions. If you do that, then you should be able to steer traffic to the surviving regions. You can do that old-school by changing DNS records, or you could have fanciers solutions such as global anycast loadbalancers and have a single IP worldwide that still goes to the closest healthy region.

lamontcg · on April 27, 2023

Nobody here with any thoughts for the operations/datacenter engineers trying to deal with stopping and cleaning up the disaster, just customers complaining...

rurp · on April 27, 2023

"Thoughts and Prayers" type comments don't make for particularly interesting reading.

I think it's safe to assume that most people feel empathy for others struggling, whether or not they type it out regularly. Then again, some AI evangelists have had me questioning that assumption lately.

BlackjackCF · on April 28, 2023

I think people who are complaining are stressed out about their own services being down.

If you’ve only ever used the cloud, you’re not necessarily aware of everything that’s involved at data centers. If you’re not familiar with them, I don’t think you’d know how many things can (literally) blow up in your face. If someone sees flooding, they generally aren’t thinking that it’ll lead to fires.

Anyway, just want to think that everyone generally has good intentions and just don’t know what’s ACTUALLY happening in the DC, or how much work it will be for the folks working in the DC to restore services.

Hopefully all the failsafes kicked in and worked and nobody was injured.

okdood64 · on April 27, 2023

Nope. Big company bad. <Insert snarky overeactionary comment based on armchair knowledge>

Literally no concern here for anyone's safety or sanity in dealing with this.

effdee · on April 27, 2023

This post (in french) has some more details:

https://www.mail-archive.com/frnog@frnog.org/msg72320.html

Jgrubb · on April 27, 2023

eu-west-9 is Paris

sgt · on April 27, 2023

Apparently the servers were told they were expected to delay their retirement a bit.

antifa · on April 27, 2023

Thanks for posting which one, this is the most important detail and should have been in the title...

manojr13 · on April 27, 2023

Let's the servers cool down for sometime. Might have been working very hard.

fnordpiglet · on April 27, 2023

Last time they let Bard pick a data center location and design.

lukax · on April 27, 2023

Google now has a "data lake" in Paris.

nyc_data_geek1 · on April 27, 2023

This is not what I meant by digital ocean

walrus01 · on April 27, 2023

This is what happens when your cloud condenses in the water/vapor cycle and returns to liquid form temporarily.

krisoft · on April 27, 2023

I don’t see the problem. Clouds are just water droplets anyway.

Joking asside I hope we will get a nice postmortem with juicy civil engineering details.

IntelMiner · on April 27, 2023

Google's DC is underwater

OVH's caught fire

What's next, us-east-1 gets hit by Godzilla?

cgb223 · on April 27, 2023

Lol us-east-1 already went down for a day back in 2017 when an intern accidentally took down the whole DC. We could call him “Godzilla”

Source: my startup (stupidly) hosted our entire infra in us-east-1 at the time. Was a …tough day

throwawaaarrgh · on April 27, 2023

us-east-1 is a great place to test your application resiliency :) it's like they threw in chaosmonkey for free!

mjr00 · on April 27, 2023

It's funny because AWS, at least for the services I knew of when I was there, did rolling deploys to each region over several days. us-east-1 was always the final day because it was the biggest region, so you'd think it'd the safest region since everything getting deployed was well-tested. But while I was there I remember at least 2 COEs where the root cause was basically, "us-east-1 had some hacky legacy configuration that no other region has and that wasn't known/accounted for."

vicch · on April 27, 2023

sounds like an easy day to me, like when power goes out, nothing to do anyway, esp. when the entire infra is there.

firstSpeaker · on April 27, 2023

Some of the regions are so critical for AWS that them going down will bring down most of the control plane :P

glogla · on April 27, 2023

Water and fire already had their way, I suspect tornado and a landslide are next.

1123581321 · on April 27, 2023

Earthquakes and wind damage should be next.

syngrog66 · on April 27, 2023

if us-east-1 is not in Tokyo its safe

nixcraft · on April 27, 2023

I hope whoever is hosting data in that zone has thoroughly tested and verified backups offline or with another cloud provider. Of course, you can complete DC failover, depending upon service needs, but it costs more resources. Either way, timely tested backups are the only way to survive natural or manufactured disasters. Good luck to Google OPs team and everyone else involved with the GCP region in the EU.

Eji1700 · on April 27, 2023

I'm down for coding to go full circle.

We called them bugs because you literally had to go in and get the dead bugs out of your electrical system.

Now we can call it fishing because some pirate has sailed onto your datalake and is looking for sunken hashes.

What do you think the hourly for Cloud Architect/Data lake power boy level 1 should start at?

samstave · on April 27, 2023

Funny enough, Lucas Film suffered a same outage because their data center was backed up to the lake, and was basically under water, and then the wall started to leak.....

CyberDildonics · on April 28, 2023

Do you have a source for that?

samstave · on April 28, 2023

Yeah, me. I was one of the lead designers on the Lucas Presidio campus, including the DC there...

When we were still designing the site/systems/infra we would go to BigRock Ranch a lot for meetings and vendor interviews.

Its the datacenter at bigrock which has one wall which backups to the lake on site, and the DC is in the sub-level parking garage, and its rear wall was leaking due to the pressure from the lake on the wall.

CyberDildonics · on April 28, 2023

Was it really an 'outage' or did they just turn things off while they plugged the leak?

samstave · on April 29, 2023

Is your data center suffering an outage due to flooding from a faulty wall you built against the man-made lake you built next to your $ Billion data center??

Call California and Meyers to see if you qualify for PG&E benefits.

samstave · on April 29, 2023

You know what would be an interesting article would be to have a ton of outage/post=mortems etc.

So some of mine woule be

Chinese air usb gaps and phishing

What stories can you expound upon?

kortex · on April 27, 2023

A perfect example of when "cloud just means someone else's computers". It's literally a leaky abstraction.

doubled112 · on April 27, 2023

Sure is leaky. And a cloud is a bunch of water vapour that eventually comes crashing down to earth. I'll never understand how we decided it was a good metaphor for a place we run our services.

Startlingly accurate in this case.

dijit · on April 27, 2023

“the cloud” comes from old network diagrams that used cloud to mean “internet” or “unknown network”.

I think “unknown network” definitely accurately captures what hyperscalers are selling. :)

88913527 · on April 27, 2023

After spending 5 minutes engaging with Product Managers, I am not at all surprised they landed on calling it 'the Cloud'.

paulmd · on April 27, 2023

https://www.youtube.com/watch?v=AnxrJiS5uKU

benatkin · on April 27, 2023

1) Pay for stuff

2) Not be able to use it

3) Company continues to pretend this doesn't happen on the regular

worldsavior · on April 27, 2023

> Company continues to pretend this doesn't happen on the regular

What do you want them to say? "Hey we have X breakdowns but please, pay!"

p1necone · on April 27, 2023

It's more that everyone advertises so many 9's of uptime, but in reality they don't count anything less than a total outage of an entire datacenter as actual downtime against that statistic.

benatkin · on April 27, 2023

Admit that it's not really serverless or cloud computing but someone else's computers.

Or just cut down on nonsense in some way.

If they can figure out how to make nonsense, they can figure out how to dial it back.

zztop44 · on April 27, 2023

I don’t see how this is a good argument? Yes, it’s someone else’s computers. That they rent me by the month or the hour or the second through a programmatic interface. That’s exactly the product I want. And it’s not surprising that it has outages sometimes because at the end of the day, yeah, it’s still just a bunch of computers someone has put in a room somewhere.

Eji1700 · on April 27, 2023

Yeah this is the usual backlash of experts vs marketing vs management.

Someone else managing your shit so you don't have to is a market in just about every industry, and it makes a ton of sense in tech where things don't even have to be on the same continent to work (or very specifically NEED to be on another one if you're international).

There's a ton of companies that have jumped to the cloud that probably shouldn't have, and even more who should've jumped, but not nearly as much as they did. Still it's a useful service.

Now of course it being a useful service, that also happens to be so obscenely expensive to start up barely anyone does it, means it comes with all the miserable obfuscation, bullshit fine print, total lack of support, and every other horrible thing we've come to expect from the modern world, but "It's just someone else's computer!" isn't changing any minds.

Either they already knew that or they never cared.

benatkin · on April 27, 2023

No, "serverless" is a toxic term. I get that it exists and I can't do anything about it except refuse to use it. It isn't something I misunderstand because I'm not a marketing person though. It is a term that I understand and disagree with.

Cloud Computing I agree is a usual concept that marketing and experts often disagree about.

throwaway2729 · on April 27, 2023

google cloud poised for precipitous fall

kotaKat · on April 27, 2023

Ah, a rainy day in the cloud.

kotaKat · on April 28, 2023

"Please don't fulminate. Please don't sneer, including at the rest of the community." @dang, since this got hidden and I can't reply to the toppost.

We're allowed a little humor, damnit.

redindian75 · on April 27, 2023

This is the problem storing data in the cloud - whenever it rains you may have a big data problem.

burnt_toast · on April 27, 2023

It's okay because once the water evaporates its backup in the cloud.

timack · on April 27, 2023

Really? Are you cirrus?

CobrastanJorji · on April 27, 2023

A series of tubes would have helped with this.

aruggirello · on April 27, 2023

Unfortunately, it appears Google Plumber was discontinued by Alphabet Inc.

iJohnDoe · on April 28, 2023

The plumber was laid off.

geocrasher · on April 27, 2023

Thank you for your input, Senator Stevens.

moffkalast · on April 27, 2023

Or a big truck to bring some tarps

throwawaaarrgh · on April 27, 2023

I know I said our pipeline abstraction was leaky but this is ridiculous

1970-01-01 · on April 27, 2023

[flagged]

sgt · on April 27, 2023

And ironed

tpmx · on April 27, 2023

[flagged]

outworlder · on April 27, 2023

The headline is not incorrect.

"Multiple Google Cloud services in the europe-west9 region are impacted.

Description: Water intrusion in europe-west9-a has caused a multi-cluster failure and has led to an emergency shutdown of multiple zones. We expect general unavailability of the europe-west9 region. There is no current ETA for recovery of operations in the europe-west9 region at this time, but it is expected to be an extended outage"

Emergency shutdown of multiple zones.

As of a few hours ago they changed the status to report just on us-west9-a.

tpmx · on April 27, 2023

> As of a few hours ago they changed the status to report just on us-west9-a.

The headline is 1h20m old and says a region is currently down.

outworlder · on April 27, 2023

Even the article linked mentions that Google initially reported that only zone A was affected, then got changed to report that the whole region was affected, now it changed to a single zone again.

Do you expect realtime updates whenever Google changes the story?

CydeWeys · on April 27, 2023

> Due to software bugs other zones in the same region were also previously down.

I don't think it was a software bug. I think they were taken down as a precautionary measure, to not risk flood damage or causing additional fires.

packetslave · on April 27, 2023

and you're basing that on... what, exactly?

faangiq · on April 28, 2023

But I thought Google only hired geniuses?

iJohnDoe · on April 28, 2023

Yes, and we often hear genius insights and anecdotes from their SRE employees who have their blog links posted to HN.