Google partners with Internet Archive to link to archives in search

pncnmnp · on Sept 11, 2024

Do you think Google is worried about a link exodus if something happens to the Internet Archive (https://news.ycombinator.com/item?id=41447758)? I started maintaining a blog and a corresponding link log in 2019, and I've already seen links disappearing, relying on the Wayback Machine for them. I've found scanned versions of obscure magazines and research papers on the IA that are impossible to find anywhere else on the web. If they do end up shutting down, I am worried that the exodus is going to be real. I'm not sure what the contingency plan is.

stuffoverflow · on Sept 11, 2024

From what I've seen of the people involved with IA and the related communities, I feel confident that in the very unlikely case of IA having to shut down, the data would be backed up elsewhere in its entirety.

giancarlostoro · on Sept 11, 2024

Internet archive is so darn useful, I wish it had captured so much more, there's times where I just can't find some niche thing I remembered. But it does capture a lot of things.

Etheryte · on Sept 11, 2024

The Internet Archive has roughly the same problem with the internet as space travel has with space, there's just so unimaginably much of it. You'd think you have some kind of a grasp of how much of it there is but it isn't anywhere close to tangible.

armchairhacker · on Sept 11, 2024

There should be a tool that runs in the background of your web browser and, every page you visit, captures and uploads it to some sort of archive. Anonymously, and it would have some way to prevent accidentally uploading bank details or other confidential information (this would have to work perfectly by default at least, so the best approach is probably a global whitelist with sites like Reddit, news etc. curated by trusted people).

Maybe this tool already exists, maybe it’s what ArchiveTeam uses, but more people should use it.

With enough people, you have an archive of websites that people actually visit (well, people who use the tool). With a few people it will only have the most popular sites and outliers, with more people and more time it will start reaching more niche content.

Furthermore, sacrifice some anonymity (at minimum you need some identity verification to prevent trivial SEO optimization, you'd probably also want region and other broad characteristics to filter) and you have a search engine. You can scrape the archived sites for keywords; you can determine how popular a site is by how many people visited it; and you can figure out whether two sites are related by the probability that someone who visits site A also visits site B.

msephton · on Sept 19, 2024

I use an app on MacOS to do this. It's called History Book. There are also browser extensions. And my bookmark manager also submits anything I save to Wayback Machine.

giancarlostoro · on Sept 12, 2024

> news etc. curated by trusted people

I don't trust anyone who is claimed to be trusted on the internet though. I rather we archive everything, even the bad, as long as its not illegal. Otherwise, you wind up with my initial issue, you don't get everything, and you miss small things that might seem insignificant but hold more value than you'd ever realize.

blooalien · on Sept 14, 2024

> "The Internet Archive has roughly the same problem with the internet as space travel has with space, there's just so unimaginably much of it. You'd think you have some kind of a grasp of how much of it there is but it isn't anywhere close to tangible."

And, what you describe is actually far more-so what you describe than people even begin to grasp, because the Internet is so much more than just the "world wide web" that most people instantly think of when they hear the word "Internet".

ktta · on Sept 11, 2024

I don't think so. Most of the web is behind a login and/or unlinkable. So you're left with 'open web'. This part is much smaller. So not impossible to archive a meaningful part of it, pretty tangible, especially the useful parts.

p0358 · on Sept 11, 2024

The meaningful part of open web is small, yes. Sadly there's so much junky pages, nowadays also partially generated by AI, previously by just copy-pasting randomly content of other pages, cluttering search results. It somehow needs to be all filtered out, otherwise it'll end up taking place instead of something more useful... So I'd really wonder how much of the open web is some kind of original content and how much is duplicate/auto-generated junk.

Etheryte · on Sept 11, 2024

I'm not convinced. There are various different estimates for how large the internet is with varying confidence, but most I found average around a few hundred zettabytes. The Internet Archive seems to be in the ballpark of a hundred petabytes. So unless I got it wrong, the archive currently covers about 0.01% of the whole thing. How much we need to cover the useful bits is a separate discussion of course.

ChrisArchitect · on Sept 11, 2024

[dupe]

More on official post: https://news.ycombinator.com/item?id=41512341

msephton · on Sept 19, 2024

This post on HN has a better title

kelsey98765431 · on Sept 11, 2024

I noticed google's cache vanished and I feared IA would just feel even more pressure from the change, this spikes that into a near terror. When there was a google cache of everything available online it was harder to point at IA because you would need to face the juggernaut if you wanted to challenge caches in court. It feels like google is basically saying "good luck" with this change now, I foresee NYT and their scummy ilk getting aggressive about their precious content farms being jeopardized. Not a good omen from my view.

namrog84 · on Sept 11, 2024

The Google cache was so useful when it was an easy click alternative to the main link. Especially with all the dynamic changes or dead links that happened. The loss of useful features imo helped reduce my reliance on Google search.

Arnavion · on Sept 11, 2024

It was also useful for getting the plaintext version of the page that Google's indexer saw and actually contained the search term, vs the real page that has it hidden in some collapsed section that isn't visible to Ctrl-F until you find it and click it.

hiddencost · on Sept 11, 2024

Ugh, yeah.

On the one hand, hopefully this deal includes funding. OTOH, I suspect that a lot of sites weren't aware of this work around and will now block it.

lofaszvanitt · on Sept 11, 2024

And will this make the IA go down on all fours, then die from the influx of new requests?

lobsterthief · on Sept 11, 2024

Nah, I’m sure part of this is Google helping them with caching. Google still wants the result after click to be as fast as possible.

p0358 · on Sept 11, 2024

This is low-key so useless. From the headline I expected some deeper kind of integration, indexing, ability to find old dead pages with useful info more easily.

Nope.

They just throw a generic link in a place hidden so deep that nobody will ever see it. You first need to click the three dots, then find the "More about this page" in the middle, and there it is. I don't know why they bothered announcing it at all, it's so hidden and useless they might as well have not bothered at all. Just copy-paste the website link into the IA search and the end result will be the same but faster.

mabedan · on Sept 11, 2024

> You first need to click the three dots, then find the "More about this page" in the middle, and there it is.

Sounds straightforward enough to me.

p0358 · on Sept 11, 2024

Nobody ever clicks that among all the useless junk around it. It should be right there after clicking the three dots already, which already shows basic details about the page. It's not a submenu dropdown anymore like it used to be. Nobody's ever gonna even look below the first paragraph where the main buttons are.

If they wanted to make this any useful at all, if it's so hidden already, they should have made it index the particular archived page revisions list, and display direct links to them along with the date they were taken. Just linking to IA search in the generic way is not gonna help anyone, especially that the page might then turn out to not be archived at all to begin with too.