Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I would presume Google still has all this data. They just will not let anyone else use it.

Could this be an advantage that Google can use to train their models on but others won't have access?

Google wants it to be more difficult to notice rewrites? Journalists to often have found valuable information with it?



I feel like the internet archive has taken a lot of that sort of use off of Google.

Unrelated: Google should probably think about a sizable donation to the Internet archive.


Some kind of collaboration appears to be happening between the two https://blog.archive.org/2024/09/11/new-feature-alert-access...


They should donate all their saved data from Google Cache too.


> I would presume Google still has all this data. ...

Maybe - I guess that they must have served that "cached" content from DB-records that had it all saved directly (URL X has contents Y => basically a "mirror" of the terms that they indexed) => not having to store that "mirror" (only the search index) might save quite a lot of storage space (and I/O and CPU to decompress it, as users won't be requesting it anymore) => all in all that might save quite a lot of infrastructure costs $$$.

> Could this be an advantage that Google can use to train their models on but others won't have access?

Maybe (if they decided to just get rid of the I/O related to the user requests), but on the other hand I don't know if previously any "Google-consumer" was ever able to perform mass-downloads of Google's "cached" data - could that be done without being banned by Google's webpage (or API)?


As I understand it, Google does a decent amount of rendering of a page before indexing; this a) allows it to index content loaded by JS and b) prevents some ways spammers show Google different content from users. Perhaps Google's main way of storing a page no longer matches something that can be easily served as a cache page. This might be a way to remove a legacy copy of each page and reduce storage costs.


> prevents some ways spammers show Google different content from users.

Google obviously hasn't cared about that for a long time.


Just with youtube, the surface area of these services is getting smaller and smaller and you get less and less. Too much optimization to the detriment of users. All the while search is still rooted in 90s concepts and only serves as a money making thing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: