Nearly all browsers, scrapers, etc use the same user agent these days. Tools suc...

nojs · on July 17, 2024

That’s not true.

https://platform.openai.com/docs/gptbot

mananaysiempre · on July 17, 2024

There was a discussion some days ago about one of the AI companies using a very characteristic user agent string for web crawling, but a more browser-like one for web browsing performed at the behest of the user. And there were some pertinent points there—if the AI bot is acting on an explicit request of a user, it does deserve to get treated like any other user agent more or less.

PeterisP · on July 18, 2024

It also would make some sense for various real-time-request-augmented bots to not only use the user agent string of the user's browser, but actually use the user's browser to make the request.

ceejayoz · on July 17, 2024

AI scrapers are pretty widely ignoring robots.txt, and plenty lie about their user agents. https://rknight.me/blog/perplexity-ai-is-lying-about-its-use...

I'd fully expect OpenAI to do some checks that their bot isn't getting different responses than a seemingly real request.

odo1242 · on July 17, 2024

OpenAI managed to add this after a lot of complaining, but most AI scrapers lie about their user agent and ignore robots.txt. Plus, OpenAI gets to keep all the data from before they added this.

twelve40 · on July 18, 2024

This is great, but what about the Common Crawl data they've used (or still get?), data that Bing might share with them, and other ways they acquire data past present and future? All of a sudden not so nicely labelled as GPTBot, is it?

tonetegeatinst · on July 17, 2024

Yeah this does work as long as the scraper respects robot.text

But dosnt openai and other companies use third party datasets? Like sure they do plenty of scraping but I'd bet for some stuff its cheaper to buy the dataset and then cleanup the data.