*>competing with Google was borderline impossible a decade ago. But in 2024, we ...

arnaudsm · 2024-11-12T16:19:09 1731428349

I was talking about text-only, filtered and deduped content.

Most of Google's 100PB is picture and video. Filtering the spam and deduping the content helped Google reduce the ~50B page index in 2012 to ~<10B today.

the-rc · 2024-11-12T21:40:38 1731447638

Where are these figures coming from?

Larrikin · 2024-11-13T05:37:42 1731476262

But what if I don't want to search Reddit, stack overflow, and blogs from the early 2000s and all the content you just threw away as irrelevant actually contains information I am looking for. There is an entire working generation that never heard a modem sound and has never even made a consideration for making sure their content is accessible in plaintext.

I'm sure all the LLM providers are already considering this, but there's so much important information that is locked away in videos and pictures that isn't even obvious from a transcript or description.

graemep · 2024-11-13T08:53:02 1731487982

There is still large opportunity. Most of my searches are for plain text information.

> But what if I don't want to search Reddit, stack overflow, and blogs from the early 2000s

That is a strawman. There are huge numbers of websites (including authoritative ones like governments and universities) and a lot of content.

> There is an entire working generation that never heard a modem sound and has never even made a consideration for making sure their content is accessible in plaintext.

If they want video they will do the same as everyone else and search Youtube. Different niche.

> I'm sure all the LLM providers are already considering this, but there's so much important information that is locked away in videos and pictures that isn't even obvious from a transcript or description.

That is true, but if you are getting bad search results (and the market for other search engines are people who are not happy with Google and Bing results) that does not help much are you are not seeing the information you want anyway.

flir · 2024-11-13T18:52:32 1731523952

> That is a strawman. There are huge numbers of websites (including authoritative ones like governments and universities) and a lot of content.

Ya know... a search engine that was limited to *.gov, *.edu and country equivalents (*.ac.uk, etc) would actually be pretty useful. Ok, I know you can do something like it with site: modifiers in the search, but if you know from the beginning you're never going to search the commercial internet you can bake that assumption into the design of your search engine in interesting ways.

And the spam problem goes away.

Hmm.

ipaddr · 2024-11-13T06:05:24 1731477924

That explains why you were 10 times more likely to find something 15-20 years ago then you are today. They reduced the size by dropping a lot of sites and not crawling as much. We expect google to be at 100PB x 100 with the growth of users and content over that time period. Someone made the decision to prioritize a smaller size over a more complete index and some A/B test was run and turned out well.

echelon · 2024-11-12T23:24:23 1731453863

Just serving up content from Reddit and HN and a few other websites would be enough to beat Google for most of us. Sprinkle in the top 100 websites and you have a legitimate contender.

There is no open web anymore. Google killed it. There are probably fewer than 100k useful websites in the world now. Which is good for startups, because the problem is entirely tractable.

kristopolous · 2024-11-13T00:19:55 1731457195

It's all very regional. Despite its namesake, the world-wide-web is aggressively local. Some properties are global but after a handful, it's all country/language/region based.

No matter what type of market analysis I do, I almost invariably find there's something different that say, the Koreans or the Europeans are using. The Yelp of Japan is Tabelog, the Ubereats of the UK is deliveroo, the facebook of russia is vk.ru etc.

That's really the beach head to capture - figure out what a "web region" is for a number of query use-cases and break in there.

lukas099 · 2024-11-13T02:44:53 1731465893

Search engines only searching the top 100 websites is like, the opposite of the way I want things to go...

zeagle · 2024-11-13T06:06:59 1731478019

Reddit is a good example of a company that is territorial about it's content being indexed or scrapped. I can't even access it via most of my VPN provider's servers anymore due to them blocking requests.

flir · 2024-11-13T19:00:37 1731524437

I dunno. Everybody's got their own personal long tail, and everybody's got a different long tail.

(I don't think the open web is dead, but it's looking awfully unwell).

Theodores · 2024-11-13T06:39:18 1731479958

I liken Google Search and YouTube to how Blockbusters video rental stores used to operate.

If you went into Blockbusters then there was actually a small subset of the available videos to rent. Films that had been around for decades were not on the shelves yet garbage released very recently would be there in abundance. If you had an interest in film and, say, wanted to watch everything by Alfred Hitchcock, there would not be a copy of 'The Birds' there for you.

Or another analogy would be a big toy shop. If you grew up in a small town then the toy shop would not stock every LEGO set. You would expect the big toy shop in the big city would have the whole range, but, if you went there, you would just find what the small toy shop had but piled high, the full range still not available.

Record shops were the worst for this. The promise of Virgin Megastore and their like was always a bit of a let down with the local, independently owned record shop somehow having more product than the massive record shop.

Google is a bit like this with information. Youtube is even worse. I cottoned on to this with some testing on other people's devices. Not having Apple products, I wanted to test on old iPads, Macbooks and phones. For this I needed a little bit of help from neighbours and relatives. I already knew I had a bug to workaround, and that there was a tutorial on Youtube I needed to do a quick fix so I could test everything else. So this meant I had to open Youtube on different devices owned by very different people, with their logged in account.

I was very surprised to see that we all had very similar recommendations to what I could expect. I thought the elderly lady downstairs and my sister would have very different recommendations to myself, but they did not. I am sure the adverts would have been different, but I was only there to find a particular tutorial and not be nosy.

I am sure that Google have all this stuff cached at the 'edge', wherever the local copper meets the fibre optic. It is a model a bit like Blockbusters, but where you can get anything on special request, much like how you can order a book from a library for them to get it out of storage for you.

The logical conclusion of this is to have Google text search becoming more like an encyclopedia and dictionary of old, where 90% of what you want can be looked up in a relatively small body of words. I am okay with this, but I still want the special requests. There was merit in old-school Alta Vista searches where you could do what amounts to database queries with logical 'and's 'or's and the like.

The web was written in a very unstructured way, with WYSIWYG being the starting point, with nobody using content sectioning elements to scope headings to words. This mess suits Google as they can gatekeep search, since you need them to navigate a 'sea of divs'.

Really a nation such as France with a language to keep need to make information a public good with content structured and information indexed as a government priority. This immediately screams 'big brother', but it does not have to be like that. Google are not there to serve the customer, they only care about profits. They are not the defenders of democracy and free speech.

If a country such as France or even a country such as Sweden gets their act together and indexes their stuff in their language as a public good, they can export that knowhow to other language groups. It is ludicrous that we are leaving this up to the free market.

WalterBright · 2024-11-13T07:30:56 1731483056

> It is ludicrous that we are leaving this up to the free market.

If you leave it up to the government, inevitably you're going to get only information approved by the people in power in that government.

You could call that search engine "Pravda".

mapt · 2024-11-12T21:23:39 1731446619

You don't have to match Google's technical prowess if that capability is being superceded by MBAs doing aggressive enshittification.

munk-a · 2024-11-12T21:49:49 1731448189

You'd be surprised how long it takes to enshittify a piece of tech as well established as Google. The MBAs may be trying but there are still a lot of dedicated folks deep in the org holding out.

echelon · 2024-11-12T23:27:26 1731454046

It's funny. I usually can't tell that from the quality of the search results.

canadianfella · 2024-11-13T01:07:04 1731460024

Compared to what?

klez · 2024-11-13T13:27:29 1731504449

Compared to Google, X years ago, for example. Unless I'm mixing threads up, that's what we're talking about anyway: the degradation of Google's search results.

kbolino · 2024-11-13T14:18:56 1731507536

Yeah, the problem is that Google today is still generally better than its competitors today, even though Google today is worse than Google yesterday.

gkbrk · 2024-11-13T23:37:05 1731541025

Result quality of Kagi is miles ahead of Google

tinodb · 2024-11-14T15:22:53 1731597773

Indeed

gregw134 · 2024-11-13T15:25:00 1731511500

The article you linked doesn't say anything about 100 petabytes

jasode · 2024-11-13T15:46:28 1731512788

>The article you linked doesn't say anything about 100 petabytes

Excerpt from the article: >[...] Caffeine takes up nearly 100 million gigabytes of storage in one database and adds new information at a rate of hundreds of thousands of gigabytes per day. [...]

Your comment did make me pause and sanity check the math: https://www.google.com/search?q=%22100+million+gigabytes%22

In any case, a lot of people translated "100 million gigabytes" to "100 petabytes" based on that blog : https://www.google.com/search?q=google+search+index+estimate...

What's the current best estimate of its size now in 2024?