I feel like this is unethical. You built yet another bot scraper. It would only be an ethical tool if it validated I own the website I am scraping before it starts.
This is probably a losing direction - protecting your little island of content in the sea of internet and LLM outputs. Get more value by exposure. This is the trend of open source, wikipedia and open scientific publication. LLMs double down on the same collaborative approach to intelligence.
You can of course decouple from the big discussion and isolate your content with access restrictions, but the real interesting activity will be outside. Look for example the llama.cpp and other open source AI tools we have gotten recently. So much energy and enthusiasm, so much collaboration. Closed stuff doesn't get that level of energy.
I think IP laws are in for a reckoning, protecting creativity by restricting it is not the best idea in the world. There are better models. Copyright is anachronic, it was invented in the era of the printing press when copying was made easy, LLMs remix they don't simply copy, even the name is unfitting for the new reality. We need to rename it remixright.
The LLM era doesn't give credit or attribution to its sources. It erases exposure. So there's a disincentive to collaborate with it, because it only takes.
> I think IP laws are in for a reckoning, protecting creativity by restricting it is not the best idea in the world.
We've been having this discussion for over 20 years since the Napster era, or even the era of elaborate anti piracy measures for computer games distributed on tapes 40 years ago.
I've reached the conclusion that the stable equilibrium is "small shadow world": enough IP leakage for piracy and preservation, but on a noncommercial scale. We sit with our Plex boxes and our adblockers, knowing that 90% of the world isn't doing that and is paying for it. Too much control is an IP monopoly stranglehold where it costs multiple dollars to set a song as your phone ringtone or briefly heard background music gets your video vaporised off social media. Too _little_ control and eventually there is actually a real economic loss from piracy, and original content does not get made.
AI presents a third threat: unlimited pseudo-creative "slop", which is cheap and adequate to fill people's scrolling time but does not pay humans for its creation and atrophys the creative ecosystem.
> In my naive take, authors get to decide how their work is used, not scrapers.
Inasmuch as they've put it on the public web they've already made a decision on who gets to see it, and you really can't stop people from doing what they want with it on a personal level.
If that's print it out and put it on a wall in my house, or use whatever tools I have at my disposal to consume it in any way I please, there's not really anything the author can do about it.
Copyright law says otherwise. As for enforcing the law, you're right, it may be difficult for individual authors to move the needle. But that that doesn't mean it's ok for scrapers to violate the law.
As to what constitutes fair use, that's a whole other story: some scraping may be found to be legal while others may not. Benefiting monetarily from legally dubious scraping only makes that scraping look more infringe-y. Of course, nothing is settled law until a court decides.
The French newspaper blatantly lied on how metadata tags works in the EU debates so I wouldn't trust them on this subject.
That was actually a big enlightening moment for me, as long as money is involved, the so called ethics were out of the window instantly. From the far left newspapers to the far right ones, they all lied on this topic. Only a handful tech blogs and newspapers did tell the truth.