One thing I haven't seen anyone bring up yet in this thread, is that there's a big risk of leakage. If even big image models had CSAM sneak into their training material, how can we trust data from our time hasn't snuck into these historical models?
I've used Google books a lot in the past, and Google's time-filtering feature in searches too. Not to mention Spotify's search features targeting date of production. All had huge temporal mislabeling problems.
Also one of our fears. What we've done so far is to drop docs where the datasource was doubtful about the date of publication, if there are multiple possible dates we take the latest to be conservative. During training, we validate that the model learns pre- but not post-cutoff facts. https://github.com/DGoettlich/history-llms/blob/main/ranke-4...
If you have other ideas or think thats not enough, I'd be curious to know! (history-llms@econ.uzh.ch)
I've used Google books a lot in the past, and Google's time-filtering feature in searches too. Not to mention Spotify's search features targeting date of production. All had huge temporal mislabeling problems.