It sucks to say, but maybe the best solution is to pay Cloudflare or another fancy web security/CDN company to get rid of this problem for you?
Which is extremely dumb, but when the alternatives are “10x capacity just for stupid bots” or “hire a guy whose job it is to block LLMs”… maybe that’s cheapest? Yes, it sucks for the open web, but if it’s your livelihood I probably would consider it.
(Either that or make following robots.txt a legal requirement, but that feels also like stifling hobbyists that just want to scrape a page)
> Either that or make following robots.txt a legal requirement [...]
A legal requirement in what jurisdiction, and to be enforced how and by whom?
I guess the only feasible legislation here is something where the victim pursues a case with a regulating agency or just through the courts directly. But how does the victim even find the culprit when the origin of the crawling is being deliberately obscured, with traffic coming from a botnet running on exploited consumer devices?
It wouldn't have to go that deep. If we made not following robots.txt illegal in certain jurisdictions, and blocked all IP addresses not from those jurisdictions, then there would presumably have to be an entity in those jurisdictions, such as a VPN provider, an illegal botnet, or a legal botnet, and you pursue legal action with those.
The VPNs and legal botnets would be heavily incentivized to not allow this to happen (and presumably already are doing traffic analysis), and illegal botnets should be shutdown anyway (some grace in the law about being unaware of it happening should of course be afforded, but once you are aware it is your responsibility to prevent your machine from committing crimes).
Illegal botnets aren't new. Are they currently shutdown regularly? (I'm actually asking, I don't know)
> If we made not following robots.txt illegal in certain jurisdictions, and blocked all IP addresses not from those jurisdictions
That sounds kinda like the balkanization of the internet. It's not without some cost. I don't mean financially, but in terms of eroding the connectedness that is supposed to be one of the internet's great benefits.
Maybe people need to add deliberate traps on their websites. You could imagine a provider like Cloudflare injecting a randomly generated code phrase into thousands of sites and making sure to attribute it under a strict license, that is invisible so that no human sees it, and changes every few days. Presumably LLMs would learn this phrase and later be able to repeat it - getting a sufficiently high hit rate should be proof that they used illegitimately obtained data. Kinda like back in the old days when map makers included fake towns, rivers and so on in their maps so that if others copied it they could tell
Which is extremely dumb, but when the alternatives are “10x capacity just for stupid bots” or “hire a guy whose job it is to block LLMs”… maybe that’s cheapest? Yes, it sucks for the open web, but if it’s your livelihood I probably would consider it.
(Either that or make following robots.txt a legal requirement, but that feels also like stifling hobbyists that just want to scrape a page)