HW noob here, anyone here has insight on how an issue like this passes EM simulation during development? I understand that modern chips are way too complex for full formal verification, but I'd have thought memory modules would be so highly structurally regular that it might be possible there despite it.
I am no expert in the field, but my reading of the original rowhammer issue (and later partial hardware mitigations) was that it was seen as better to design RAM that works fast and is dense and get that to market, than to engineer something provably untamperable with greater tolerances / die size / latency.
GPUs have always been squarely in the "get stuff to consumers ASAP" camp, rather than NASA-like engineering that can withstand cosmic rays and such.
I also presume an EM simulation would be able to spot it, but prior to rowhammer it is also possible no-one ever thought to check for it (or more likely that they'd check the simulation with random or typical data inputs, not a hitherto-unthought-of attack vector, but that doesn't explain more modern hardware).
I seem to recall that rowhammer was known- but thought impossible for userland code to implement.
This is a huge theme for vulnerabilities. I almost said "modern" but looking back I've seen the cycle (disregard attacks as strictly hypothetical. Get caught unprepared when somebody publishes something making it practical) happen more than a few times.
someone did a javascript rowhammer in 2015, hardware that's vulnerable today is just manufacturers and customers deciding they don't want to pay for mitigation
(personally I think all RAM in all devices should be ECC)
We don't want "mitigation", we want true correctness --- or at least the level of perfection achievable before manufacturers thought they could operate with negative data integrity margins and convinced others that it was fine (one popular memory testing utility made RH tests optional and hidden by default, under the reasoning that "too many DIMMs would fail"!) All DRAM generations before DDR2 and early DDR3 didn't have this problem.
RAM that doesn't behave like RAM is not RAM. It's defective. ECC is merely an attempt at fixing something that shouldn't've made it to the market in the first place. AFAIK there is a RH variant that manages to flip bits undetectably even with ECC RAM.
> manufacturers and customers deciding they don't want to pay
It's more of a tragedy-of-the-commons problem. Consumers don't know what they don't know and manufacturers need to be competitive with respect to each other. Without some kind of oversight (industry standards bodies or goverment regulation), or a level of shaming that breaks through to consumers (or e.g. class action lawsuits that impact manufacturers), no individual has any incentive to change.
Shame is an underrated way towards pushing for better standards. The problem is getting people in the know, and having them vote with their wallet, or at least public sentiment (social media pressure).
The manufacturers tried to sweep it under the rug when the first RowHammer came out. One of the memory testing utilities added tests for it, and then disabled those because they would cause too many failures.
I'm coming back to note: die shrinks, density increase, and frequency increases- while keeping costs from going out of control all work together to make rowhammer inevitable. I maintain they knew about it, dismissed it as impractical, tested if it was a concern in normal usage... and were surprised + hobbled by their pants, when a PoC hit the public.
I'm not versed in silicon fabrication to know if theres ameliorations involved past what hit the press near 20 years ago now. But, while deep diving modern DRAM for an idea, its shocking how small a change is needed to corrupt a Bit in DRAM.
Only means might be cultural. Security conferences such as DefCon or Black Hat create list of insecurely technology that is ubiquitousness and ignored by product designers and OEMs. Vote on ranking their priority and when they should be removed.
News would latch on to "Hacks say all computers without ECC RAM are vulnerable and should not be purchased for their insecurity. Manufacturers like Dell, Asus, Acer, ... are selling products that help hackers steal your information." "DefCon Hackers thank Nvidia for making their jobs easier ..."
Such statements would be refreshed during / after each security conference. There are over 12 conferences a year, about once a month these would be brought back into the public as a reminder. Public might stop purchasing from those manufacturers or choose the secure products to create the change.
but prior to rowhammer it is also possible no-one ever thought to check for it
It was known as "pattern sensitivity" in the industry for decades, basically ever since the beginning, and considered a blocking defect. Here's a random article from 1989 (don't know why first page is missing, but look at the references): http://web.eecs.umich.edu/~mazum/PAPERS-MAZUM/patternsensiti...
...and essentially said "who cares, let someone else be responsible for the imperfections while we can sell more crap", leading to the current mess we're in.
The flash memory industry took a similar dark turn decades ago.
Given that I wasnt surprised by the headlie Inhave to imagine that nvidia engineers were also well aware.
Nothing is perfect, everything has its failure conditions. The question is where do you choose to place the bar? Do you want your component to work at 60, 80, or 100C? Do you want it to work in high radiation environments? Do you want it to withstand pathological access patterns?
So in other words, there isnt a sufficent market for GPUs at double the $/GB RAM but are resilient to rowhammer attacks to justify manufacturing them.
The idea of pathological RAM access patterns is as ridiculous as the idea of pathological division of floating point numbers. ( https://en.wikipedia.org/wiki/Pentium_FDIV_bug ). The spec of RAM is to be able to store anything in any order, reliably. They failed the spec.
Rowhammer is an inherent problem to the way we design DRAM. It is a known problem to memory manufacturers that is very hard, if not impossible, to fix. In fact, Rowhammer only becomes worse as the memory density increases.
It’s a matter of percentages… not all manufacturers fell to the rowhammer attack.
The positive part of the original rowhammer report was that it gave us a new tool to validate memory (it caused failures much faster than other validation methods).