This seems to significantly under-report what's going on, see: https://www.there...

madaxe_again · on April 27, 2023

Oh, the irony.

A few years ago I implemented a top to bottom ISO27k1 ISMS for a client handling extremely sensitive and mission-critical data for industry.

One risk I recommended controls for was that of a fire and/or flood at their primary datacentre for their client-facing offerings - this datacentre. I’ve experienced the misery of a datacentre oops myself, firsthand, twice, and it’s a genuine risk that has to be mitigated.

At my insistence, I had them burn hundreds of man-hours ensuring that they could failover to a new environment in a different datacentre with a bare minimum of fuss, as what I arrived to was an all the eggs in one basket situation. It took a fair bit of re-engineering of how deployments worked, how data was replicated, how the environment was configured - but they got there, and the ISMS was put into operation, and was audited cleanly by a reputable auditor, and everyone lived happily ever after.

Except… they were acquired by private equity. Who had no truck with all of this costly prancing about with consultants and systems. Risk register? Why do we need this? What value does it add today? ISO27k1? Don’t be silly. We have that certificate. You don’t need it. Dev team, ops team, leadership — almost everyone — ejected and replaced with a few support staff.

I see their sites are down.

jacquesm · on April 27, 2023

There's that beautiful German word again... schadenfreude. I have had similar discussions multiple times in the last year and the magic thinking around the cloud is so strong that it is sometimes impossible to get through. The fact that cloud stuff can go down and that in the end it is your data and no amount of cloud credits are going to help you if your data is lost seems to be utterly beyond some people's comprehension.

madaxe_again · on April 28, 2023

In this case it wasn’t so much “the cloud is invincible!” as “this appears to be a cost centre” - their whole shtick is to boost profitability in the short term by gutting businesses, and then selling them onwards to some other finance sucker.

I am sure that my name is currently being damned in a boardroom in Chicago, as as the person who warned of this, I am likely seen as responsible.

jacquesm · on April 28, 2023

> I am sure that my name is currently being damned in a boardroom in Chicago, as as the person who warned of this, I am likely seen as responsible.

In that sense nothing ever changes. Anyway, kudos to you for staying the course on topics like these.

illumin8 · on April 29, 2023

This really has nothing to do with cloud and is more of an "all eggs in one basket" problem. I wish people would stop painting cloud itself as less capable.

The fact is, most cloud providers offer multiple regions, which have the capability of giving you more geographic redundancy than most companies that operate in their own datacenters have.

Whether you choose to adopt a multi-region or multi-datacenter architecture is really orthogonal to whether you choose cloud or on-prem.

DebtDeflation · on April 27, 2023

Plot twist: the server racks were made out of sodium.

H8crilA · on April 27, 2023

You're not far off: the batteries are (probably) made of lithium.

Also, why batteries in a datacenter? When you implement a flush() command at the lowest level you're faced with two choices: 1) actually write to disk, then return from the call, 2) write to some cache/RAM and have just enough battery locally to ensure that you can write it to disk even if all power goes out.

Then there's the other problem of surviving long enough between a power interruption and diesel generators starting up. But this is a smaller problem, rebooting all instances in a datacenter is less bad than losing some data that was correctly flush()ed by software. Bad flush() behaviour can result in errors that cannot be recovered from without a complicated manual intervention (for example if it causes corrupted and unreadable database files).

hoofhearted · on April 27, 2023

The batteries in the datacenter are simply there to hold the power until the generators are all up and running, and the phases are in sync.

They create 3 separate arrays of batteries in each back. Each array represents a power phase, A-B-C.. if I remember correctly, each array has a number of low voltage/2000 amp batteries connected in series to make up for a 2000amp 480 volt leg on the other end.

In a tier 4 plus+1 datacenter, they have 4 battery rooms and 4 generators for each data pod. You have a primary generator and UPS battery set, and a backup generator set for each pod. And then that generator set has its own primary and secondary backup set. The end result is that they can work on any piece of equipment without interrupting power. In the event they lost the primary set or needed to take it offline for maintenance, they have the whole secondary redundant set to fallback on.

The servers on the received on the power cord after it passes the switchgear never know that there has been power source changes on the other end.

hoofhearted · on April 28, 2023

I wish I understood why my apple autocorrect stopped working lol

walrus01 · on April 27, 2023

> Also, why batteries in a datacenter?

Everything serious in the telecom/ISP infrastructure sector has a big -48VDC battery plant, or preferably separate A and B side -48VDC battery plants, to provide a significant buffer between power going Grid --> AC-to-DC Rectifiers --> Equipment, and when a generator can start up, warm up, and transfer switch does its job.

Even if a bunch of servers don't have any UPS or battery backup because they're designed to tolerate individual node (or whole rack, or whole row failures) the core network equipment in a datacenter will still have a huge battery plant.

Ideally if you have a chilled water loop for cooling you do not want it anywhere near your big-ass racks of batteries. Or near the racks that contain the rectifiers and DC breakers, distribution bus bars.

If you look at the battery racks in a traditional telco CO in the US for instance you will see that all of the cabling and batteries are a minimum of 1 foot off the floor, so that the whole place could theoretically flood and the DC distribution would remain unaffected. Same principle that applies to very traditional setups with wet-cell 2V lead acid batteries also applies to more modern things if building from scratch.

eep_social · on April 27, 2023

Very different trade offs in play for google who run with a relatively high tolerance for failure at the individual machine or even rack level. At one point I believe there were batteries in every rack, though I don’t know what they're building these days. A telco DC is gonna have more network interconnect with lower tolerance for failure due to capacity impact that isn’t easy to double.

Think like a fiber termination demarc vs an in-cluster mesh.

manquer · on April 27, 2023

Google cloud cannot run high tolerance failures . Google the product wouldn’t notice region or zone going down , google cloud customers will .

walrus01 · on April 27, 2023

What I was saying above is that the 'core' of a google DC has a massive amount of network interconnect and needs for battery backup not very different from a big IX point or traditional "primary CO" for a city in a telco environment.

By square footage maybe 95% of a google DC might have no UPS or battery backup but the core network for things like routers and DWDM transport equipment absolutely will have such.

If they were unlucky enough that the burst cooling loop met with the battery plant for the core gear in a building or small campus of buildings....

ironmagma · on April 27, 2023

NaCl, the revolutionary Sodium Cloud technology.

pcurve · on April 27, 2023

The outage has been going on for 40+ hours now...

I think this is sort of big.

jonatron · on April 27, 2023

This doesn't sound as bad as OVH's 2021 fire.

nik736 · on April 27, 2023

Well, we had pictures very quickly of the OVH fire. Google seems to be not very transparent on what is exactly happening...

stingraycharles · on April 27, 2023

The linked article says that there was a leak in a water cooling system, which in turn ended up in the battery system which caused a fire. But yeah it’s not coming from Google but second hand reports.

mike_hearn · on April 28, 2023

It's not second hand, it's from the colo provider they're using in Paris.

sschueller · on April 27, 2023

If you trench a fire in water in a DC it might be just as bad.

jacquesm · on April 27, 2023

I wouldn't draw any conclusions just yet.