That is a great question. Well we first have to ask what would be the purpose of someone going through the trouble to create such (LLM generated) spammy SEO content? The answer (for the majority of web at least now) is to monetize it with ads/affiliate links. If that is the case then the answer is easy as we already penalize sites with ads/trackers on them for our general web search, and completely boot them out of our own index.
In parallel, we are developing LLM-content detector technology to be more efficient at detecting such content regardless of how it is monetised (and we will offer this as an API once developed).
This is a naive take. SEO schemes are attractive for companies that sell products themselves (e.g. try searching anything related to ETL tools). The content itself is the ad and you won’t find any ad serving scripts or affiliate links in there.
(Source: have created such schemes, although would generally not recommend them to my customers nowadays)
Underestimate the average Kagi user at your own peril. I do not think many would fall prey to an LLM generated content marketing page and end up buying a product from such site. Much likelier scenario is the page gets instantly blocked/reported.
They want to index companies that sell products. I don't see a big problem here if a company that sells a product I'm searching for, who happens to also have low-quality SEO content, shows up in that search.
In fact, I would rather they not get penalized for it, since low-quality SEO content is a good way to show up in certain other search engines (Google), and every business wants to show up in Google, making that content quite common even from reputable businesses making a quality product.
As someone who in a past life spent loads of time doing of SEO I cannot help but find this argument flawed.
So, we shouldn’t penalize low quality, SEO, spam because of people’s wants? I do want them to penalize those sites because they are a disservice and more often than not crappy, unsecured WordPress that drowns out those that are not spam.
Thank you Kagi team! A shame how far Google’s results have fallen.
Edit: also SEO is one of the more seedier parts of the software industry. Tons of unaware small businesses conned into these awful, low quality sites. I literally quit because it was so morally bankrupt.
Problem is that many webites used to hire writers which wrote tangentially related posts to get their main product higher ranked. Like LogRocket and Partition Minitool do.
Combine that with that guy who boasted about his 'SEO heist', I think it's a very valid concern.
I have solved many problems because of a blog post created by company that wanted to get their product name out there and I don't think they should be looked at negatively for doing that. Are upset whenever a companies tech blog lands on HN? Because it is virtually the same thing. If you use Kagi and come across a site that you find is low quality and spammy then just block it. That's the cool thing about using Kagi.
I've also found that type of developer marketing valuable many times in the past. It's sometimes obvious its going to end in a pitch for the product, but often it does a good job summarizing the key problems in the space, mentioning or showing other solutions / offerings, and pitching which tradeoffs they made for their own product and how they solved issues.
Even if you don't go with the ad, you can quickly pivot to other named players or get a better understanding of the terminology or jargon to start searching more.
My general impression of the LogRocket site is that they have decent articles on how to do frontend development. At least that what I remember from the times I've been directed there by a search engine.
And we…want to discourage writing useful web pages, even though articles on understanding TypeScript's type system aren't all that closely related to their main product…? What am I missing?
Another decent one would be linux sysadmin info from Digital Ocean and the likes.
But for every joelonsoftware there are 99999 sites that have all copy/pasted the same tutorial about something basic and try to push some random product or just ads.
SEO pages pushing some product are SEO pages pushing some product. You should ignore them no matter what the source is, so what does it matter if they're LLM generated or hand written?
The problem is that people keep consuming the samey low quality content instead of skipping it (think superhero movies and Netflix series that are all indistinguishable from each other). As long as they're satisfied with that, they'll fall for fake product reviews too.
Maybe you can't determine that with certainty, but there may be statistical tools you can use to estimate the probably that some content came from one of the LLMs we know about based on their known writing styles?
Someone did something like that to identify HN authors (as in correlating similar writing styles between pseudonyms) a few years back, for example: https://news.ycombinator.com/item?id=33755016
Of course, LLM output can be tweaked to evade these, just like humans can alter their writing style or handwriting to better evade detection. But it's one approach.
That's a digital signature, same as sending an email with GPG to prove you sent it. You wouldn't say that because some people use GPG you can somehow detect who wrote every email on earth, it's a push model vs pull. This is why I wrote "any sentence" vs "some sentences".
Watermarking is not at all like a digital signature and a lot like steganography. I only have a surface level understanding of the process, but it works by biasing token selection to encode information into the resulting text in a way that's resistant to later modifications and rephrasing.
I have my doubts about the effectiveness of this method and realistically, it won't make any difference because the bad actors will just use an LLM that doesn't snitch on them, so you're technically correct.
The only way to make that stenography robust is to have the encoded message be generated with some secret key that can be verified. Otherwise anyone could manually fake the stenography into human typed messages assisted by some encoder and you'd have no way of telling if it was really typed by an LLM. That line of thinking is what makes it have to be like a signature to work like you said for "any sentence". I also think these methods only work above certain character limits. Short messages are impossible to tell.
If you look here : GitHub.com/HNx1/IdentityLM you can see that it’s relatively easy to sign LLM output with a private key using an adaptation of the watermarking method.
This application is exactly what I was describing. I'll look it over to see how it scales the encryption strength based on token length or how it deals with short messages, which is the only thing I'd think it'd be very hard to do. If you print 2 paragraphs it's easy to change some tokens with a secret key mask but if you print "Yes", it's not so easy. Thanks for the great share.
In parallel, we are developing LLM-content detector technology to be more efficient at detecting such content regardless of how it is monetised (and we will offer this as an API once developed).