In our previous tests, when it was 1.5 Pro against GPT 4o and Claude Sonnet 3.7, Gemini wasn't winning in the multilingual race, but it was definitely competitive. 2.5 and 3.0 seems to be big leaps from the 1.5 days.
That said, it also depends on the testing methodology; we tested a bunch of use cases mostly to test core linguistic proficiency. Not as much complex tasks in language or cultural knowledge.
Which languages, how popular, how many? The biggest difference has been for low-resource or far-from-English languages. Thai, Korean, Vietnamese, and so on. For something like German or French all of them were of course good enough that general intelligence and other factors overruled any language differences. I didn't take screenshots, maybe archive.org has them, but during the entire period of that generation of models on the LMArena leaderboard there was this large gap between 1.5 Pro rankings on such languages vs on English, which was backed up by our experience including feedback from groups of native speakers.