He's a former Spotify employee now, but he was a Spotify employee when he made it. I think it hasn't been updated since he lost his data access.
I have a lot of respect for Glenn McDonald for spam fighting all these years on Spotify, but we can go better than PCA for mapping music these days. Any neural embedding model is going to produce more meaningful axes. In fact Spotify had an intern who did just that, just before the launch of Discover Weekly: Sander Dieleman. Along with Aäron van den Oord he was snapped up by Deepmind after their Spotify internship. Those two guys were (and are) wildly good at what they do.
Ad supported streams in Spotify are counted in a separate pool, and only get paid out of the ad revenue pool.
Artists can of course complain that "they're selling our music for cheap!", especially in the ad pool. But what's worth remembering is that when it comes to setting optimal price points, Spotify's interest is almost perfectly aligned with the artists. And Spotify has a hell of a lot more data than artists (not to mention financial sense, which you probably didn't become an artist if you had a lot of).
> Ad supported streams in Spotify are counted in a separate pool, and only get paid out of the ad revenue pool.
What are the rough rates for each pool? That's the important part here. And how many artists are far enough from the average ratio that the detail of two pools matters.
I'd be interested in knowing that too, as far as I know Spotify doesn't publish details to the public at least.
But I have no trouble believing some artists will be vastly overrepresented in the ad financed pool. Also, there are separate pools by country, and countries have different subscription prices - being big in Japan will be more profitable than being big in India.
Payout per stream is a terrible metric. It's almost like if you ranked grocery stores by payment per gram.
> Payout per stream is a terrible metric. It's almost like if you ranked grocery stores by payment per gram.
CDs are usually similar prices. Per-stream isn't nearly as bad as wildly different products sharing prices.
We could debate per stream versus per minute but I don't know if that's a particularly big effect. It causes some annoyance but it's mostly compensated for already.
Anything that gives different value to different artists is probably going to favor the big ones and just make things worse.
CDs get wildly different number of plays. But the number of plays, whether from a record or from a streaming service, isn't proportional to how glad you are that this music exists and you can listen to it.
The present system favors big artist rights owners a lot, but most of all it rewards owners of music played on repeat, i.e. background music.
Self-supplied metadata in music catalogs is notoriously shit. The degree to which most rights owners don't give a damn is telling.
Spotify's own metadata is not particularly sophisticated. "Valence", "Energy", "Danceability", etc. You can see from a mile away that these are assigned names to PCA axes which actually correspond pretty poorly to musical concepts, because whatever they analyzed isn't nicely linearly separable.
> Anna’s archive business is stealing copyrighted content and selling access to it.
There is not enough profit in that compared to the risk. They're also not exactly aggressive about it (there are groups which host mirrors who charge far more/finance it in the usual criminal way of getting people to install malware).
To me, there's a "motivation gap" between what they get out of this and the effort it takes, so there's some kind of "ideology". Whether it's 100% what they say it is, is another question.
I appreciate having an OCR interface rather than having to chat with a bot, but unfortunately chatting with Gemini 3 gives far better results than this. I gave it the document Gemini 3 got a surprisingly good result on:
Just out of pity I gave it a birthday card from my sister written in very readable modern handwriting, and while in managed to make the contents of that readable, the errors it made reveals that it has very little contextual intelligence. Even if ! and ? can be hard to tell apart sometimes, they weren't here, and you do not usually start a birthday letter with "Happy Birthday brother?"
Something I noticed about gemini: I've been experimenting with transcribing old handwritten gaelic archives. Qwen 235b a22b instruct appears to give a much more faithful reproduction compared to gemini, for the simple fact that gemini keeps hallucinating an old gaelic faerie tale
I believe you misread. My reading is that Gemini 3 gave a good result on a certain input, so they gave the same input to this model and the result was poor.
I'd like to see a version of the HN frontpage, where the titles are reinterpreted by that 1913 AI. "Imagine these are newspaper headlines from the year 2025. Rewrite them so that a regular person in our time can understand them."
One thing I haven't seen anyone bring up yet in this thread, is that there's a big risk of leakage. If even big image models had CSAM sneak into their training material, how can we trust data from our time hasn't snuck into these historical models?
I've used Google books a lot in the past, and Google's time-filtering feature in searches too. Not to mention Spotify's search features targeting date of production. All had huge temporal mislabeling problems.
Also one of our fears. What we've done so far is to drop docs where the datasource was doubtful about the date of publication, if there are multiple possible dates we take the latest to be conservative. During training, we validate that the model learns pre- but not post-cutoff facts. https://github.com/DGoettlich/history-llms/blob/main/ranke-4...
If you have other ideas or think thats not enough, I'd be curious to know! (history-llms@econ.uzh.ch)
reply