>competing with Google was borderline impossible a decade ago. But in 2024, we have cheap compute, great OSS distributed DBs, powerful new vector search tech. [...] CommonCrawl text-only is ~100TB,
A modest homegrown tech stack of 2024 can maybe compete with a smaller Google circa ~1998 but that thought experiment is handicapping Google's current state-of-the-art. Instead, we have to compare OSS-today vs Google-today. There's still a big delta gap between the 2024 OSS tech stack and Google's internal 2024 tech stack.
E.g. for all the billions Microsoft spent on Bing, there are still some queries that I noticed Google was better at. Google found more pages of obscure people I was researching (obituaries, etc). But Bing had the edge when I was looking up various court cases with docket #s. The internet is now so big that even billion dollar search engines can't get to all of it. Each has blindspots. I have to use both search engines every single day.
I was talking about text-only, filtered and deduped content.
Most of Google's 100PB is picture and video. Filtering the spam and deduping the content helped Google reduce the ~50B page index in 2012 to ~<10B today.
But what if I don't want to search Reddit, stack overflow, and blogs from the early 2000s and all the content you just threw away as irrelevant actually contains information I am looking for. There is an entire working generation that never heard a modem sound and has never even made a consideration for making sure their content is accessible in plaintext.
I'm sure all the LLM providers are already considering this, but there's so much important information that is locked away in videos and pictures that isn't even obvious from a transcript or description.
There is still large opportunity. Most of my searches are for plain text information.
> But what if I don't want to search Reddit, stack overflow, and blogs from the early 2000s
That is a strawman. There are huge numbers of websites (including authoritative ones like governments and universities) and a lot of content.
> There is an entire working generation that never heard a modem sound and has never even made a consideration for making sure their content is accessible in plaintext.
If they want video they will do the same as everyone else and search Youtube. Different niche.
> I'm sure all the LLM providers are already considering this, but there's so much important information that is locked away in videos and pictures that isn't even obvious from a transcript or description.
That is true, but if you are getting bad search results (and the market for other search engines are people who are not happy with Google and Bing results) that does not help much are you are not seeing the information you want anyway.
> That is a strawman. There are huge numbers of websites (including authoritative ones like governments and universities) and a lot of content.
Ya know... a search engine that was limited to *.gov, *.edu and country equivalents (*.ac.uk, etc) would actually be pretty useful. Ok, I know you can do something like it with site: modifiers in the search, but if you know from the beginning you're never going to search the commercial internet you can bake that assumption into the design of your search engine in interesting ways.
That explains why you were 10 times more likely to find something 15-20 years ago then you are today. They reduced the size by dropping a lot of sites and not crawling as much. We expect google to be at 100PB x 100 with the growth of users and content over that time period. Someone made the decision to prioritize a smaller size over a more complete index and some A/B test was run and turned out well.
Just serving up content from Reddit and HN and a few other websites would be enough to beat Google for most of us. Sprinkle in the top 100 websites and you have a legitimate contender.
There is no open web anymore. Google killed it. There are probably fewer than 100k useful websites in the world now. Which is good for startups, because the problem is entirely tractable.
It's all very regional. Despite its namesake, the world-wide-web is aggressively local. Some properties are global but after a handful, it's all country/language/region based.
No matter what type of market analysis I do, I almost invariably find there's something different that say, the Koreans or the Europeans are using. The Yelp of Japan is Tabelog, the Ubereats of the UK is deliveroo, the facebook of russia is vk.ru etc.
That's really the beach head to capture - figure out what a "web region" is for a number of query use-cases and break in there.
Reddit is a good example of a company that is territorial about it's content being indexed or scrapped. I can't even access it via most of my VPN provider's servers anymore due to them blocking requests.
I liken Google Search and YouTube to how Blockbusters video rental stores used to operate.
If you went into Blockbusters then there was actually a small subset of the available videos to rent. Films that had been around for decades were not on the shelves yet garbage released very recently would be there in abundance. If you had an interest in film and, say, wanted to watch everything by Alfred Hitchcock, there would not be a copy of 'The Birds' there for you.
Or another analogy would be a big toy shop. If you grew up in a small town then the toy shop would not stock every LEGO set. You would expect the big toy shop in the big city would have the whole range, but, if you went there, you would just find what the small toy shop had but piled high, the full range still not available.
Record shops were the worst for this. The promise of Virgin Megastore and their like was always a bit of a let down with the local, independently owned record shop somehow having more product than the massive record shop.
Google is a bit like this with information. Youtube is even worse. I cottoned on to this with some testing on other people's devices. Not having Apple products, I wanted to test on old iPads, Macbooks and phones. For this I needed a little bit of help from neighbours and relatives. I already knew I had a bug to workaround, and that there was a tutorial on Youtube I needed to do a quick fix so I could test everything else. So this meant I had to open Youtube on different devices owned by very different people, with their logged in account.
I was very surprised to see that we all had very similar recommendations to what I could expect. I thought the elderly lady downstairs and my sister would have very different recommendations to myself, but they did not. I am sure the adverts would have been different, but I was only there to find a particular tutorial and not be nosy.
I am sure that Google have all this stuff cached at the 'edge', wherever the local copper meets the fibre optic. It is a model a bit like Blockbusters, but where you can get anything on special request, much like how you can order a book from a library for them to get it out of storage for you.
The logical conclusion of this is to have Google text search becoming more like an encyclopedia and dictionary of old, where 90% of what you want can be looked up in a relatively small body of words. I am okay with this, but I still want the special requests. There was merit in old-school Alta Vista searches where you could do what amounts to database queries with logical 'and's 'or's and the like.
The web was written in a very unstructured way, with WYSIWYG being the starting point, with nobody using content sectioning elements to scope headings to words. This mess suits Google as they can gatekeep search, since you need them to navigate a 'sea of divs'.
Really a nation such as France with a language to keep need to make information a public good with content structured and information indexed as a government priority. This immediately screams 'big brother', but it does not have to be like that. Google are not there to serve the customer, they only care about profits. They are not the defenders of democracy and free speech.
If a country such as France or even a country such as Sweden gets their act together and indexes their stuff in their language as a public good, they can export that knowhow to other language groups. It is ludicrous that we are leaving this up to the free market.
You'd be surprised how long it takes to enshittify a piece of tech as well established as Google. The MBAs may be trying but there are still a lot of dedicated folks deep in the org holding out.
Compared to Google, X years ago, for example. Unless I'm mixing threads up, that's what we're talking about anyway: the degradation of Google's search results.
>The article you linked doesn't say anything about 100 petabytes
Excerpt from the article: >[...] Caffeine takes up nearly 100 million gigabytes of storage in one database and adds new information at a rate of hundreds of thousands of gigabytes per day. [...]
Those example 3 bullet points of today's improved 2024 computing power you list isn't even enough to process Google's scale 14 years ago in 2010 when the search index was 100+ petabytes: https://googleblog.blogspot.com/2010/06/our-new-search-index...
A modest homegrown tech stack of 2024 can maybe compete with a smaller Google circa ~1998 but that thought experiment is handicapping Google's current state-of-the-art. Instead, we have to compare OSS-today vs Google-today. There's still a big delta gap between the 2024 OSS tech stack and Google's internal 2024 tech stack.
E.g. for all the billions Microsoft spent on Bing, there are still some queries that I noticed Google was better at. Google found more pages of obscure people I was researching (obituaries, etc). But Bing had the edge when I was looking up various court cases with docket #s. The internet is now so big that even billion dollar search engines can't get to all of it. Each has blindspots. I have to use both search engines every single day.