Get there by what mechanism? In the near term a good model pretty much requires a GPU, and it needs a lot of VRAM on that GPU. And the current state of the art of quantization has already gotten us most of the RAM-savings it possibly could.
And it doesn't look like the average computer with steam installed is going to get above 8GB VRAM for a long time, let alone the average computer in general. Even focusing on new computers it doesn't look that promising.
By M series and amd strix halo. You don't actually need a gpu, if the manufacturer knows that the use case will be running transformer models a more specialized NPU coupled with higher memory bandwidth of on the package RAM.
This will not result in locally running SOTA sized models, but it could result in a percentage of people running 100B - 200B models, which are large enough to do some useful things.
Those also contain powerful GPUs. Maybe I oversimplified but I considered them.
More importantly, it costs a lot of money to get that high bus width before you even add the memory. There is no way things like M pro and strix halo take over the mainstream in the next few years.
And it doesn't look like the average computer with steam installed is going to get above 8GB VRAM for a long time, let alone the average computer in general. Even focusing on new computers it doesn't look that promising.