> I would presume Google still has all this data. ...
Maybe - I guess that they must have served that "cached" content from DB-records that had it all saved directly (URL X has contents Y => basically a "mirror" of the terms that they indexed) => not having to store that "mirror" (only the search index) might save quite a lot of storage space (and I/O and CPU to decompress it, as users won't be requesting it anymore) => all in all that might save quite a lot of infrastructure costs $$$.
> Could this be an advantage that Google can use to train their models on but others won't have access?
Maybe (if they decided to just get rid of the I/O related to the user requests), but on the other hand I don't know if previously any "Google-consumer" was ever able to perform mass-downloads of Google's "cached" data - could that be done without being banned by Google's webpage (or API)?
As I understand it, Google does a decent amount of rendering of a page before indexing; this a) allows it to index content loaded by JS and b) prevents some ways spammers show Google different content from users. Perhaps Google's main way of storing a page no longer matches something that can be easily served as a cache page. This might be a way to remove a legacy copy of each page and reduce storage costs.
Just with youtube, the surface area of these services is getting smaller and smaller and you get less and less. Too much optimization to the detriment of users. All the while search is still rooted in 90s concepts and only serves as a money making thing.
Could this be an advantage that Google can use to train their models on but others won't have access?
Google wants it to be more difficult to notice rewrites? Journalists to often have found valuable information with it?