I don't understand the thing about the cache. Presumably they have a model that ...

I don't understand the thing about the cache. Presumably they have a model that they are training, that must be their cache? Are they retraining the same model on the same data on the basis that that will weigh higher page ranked pages higher or something? Or is this about training slightly different models?

If they are really just training the same model, and there's no benefit to training it multiple times on that data, then presumably they could use a statistical data structure like https://en.wikipedia.org/wiki/HyperLogLog to be check if they've trained on the site before based on the Last-Modified header + URI? That would be far cheaper than a cache, and cheaper than rescraping.

I was also under the impression that the name of the game with training was to get high quality, curated training sets, which by their nature are quite static? Why are they all still hammering the web?