Yeah this is just completely wrong. Without getting into specific products, public test sets from the 1990s like Switchboard and WSJ are now at around human-level transcription accuracy rates; 20 years ago the state of the art was nowhere near that.
It's also not objectively a simpler problem. Humans are actually not particularly good at speech recognition, especially when talking to strangers and when they can't ask the speaker to repeat themselves. Consider how often you need subtitles to understand an unfamiliar accent, or reach for a lyrics sheet to understand what's being sung in a song. For certain tasks ASR may be approaching the noise floor of human speech as a communication channel.
Humans may not be particularly great at speech transcription, but they're phenomenal at speech recognition, because they can fill in any gaps in transcription from context and memory. At 95% accuracy, you're talking about a dozen errors per printed page. Any secretary that made that many errors in dictation, or a court reporter that made that many errors in transcribing a trial, would quickly be fired. In reality, you'd be hard pressed to find one obvious error in dozens of pages in a transcript prepared by an experienced court reporter. It is not uncommon in the legal field to receive a "proof" of a deposition transcript, comprising hundreds of pages, and have only a handful of substantive errors that need to be corrected. That is to say, whether or not the result is exactly what was said, it's semantically indistinguishable from what was actually said. (And that is why WER is a garbage metric--what matters is semantically meaningful errors.)
The proof of the pudding is in the eating. If automatic speech recognition worked, people would use it. (After all, executives used to dictate letters to secretaries back in the day.) The rare occasions you see people dictate something into Siri and Android, more often then not what you see is hilarious results.
That article is correct that WER has some problems, but it also correctly concludes that "Even WER’s critics begrudgingly admit its supremacy."
Yes, Switchboard has problems (I've mentioned many of them here) but it was something that 1990s systems could be tuned for. You would see even more dramatic improvements when using newer test sets. A speech recognition system from the 1990s will fall flat on its face when transcribing (say) a set of modern YouTube searches. Most systems in those days also didn't even make a real attempt at speaker-independence, which makes the the problem vastly easier.
Executives don't dictate as much any more because most of them learned to touch type.
It's also not objectively a simpler problem. Humans are actually not particularly good at speech recognition, especially when talking to strangers and when they can't ask the speaker to repeat themselves. Consider how often you need subtitles to understand an unfamiliar accent, or reach for a lyrics sheet to understand what's being sung in a song. For certain tasks ASR may be approaching the noise floor of human speech as a communication channel.