Nearly all browsers, scrapers, etc use the same user agent these days. Tools such as curl and wget are the only ones that come to mind off the top of the head that don't do that out of the box.
There was a discussion some days ago about one of the AI companies using a very characteristic user agent string for web crawling, but a more browser-like one for web browsing performed at the behest of the user. And there were some pertinent points there—if the AI bot is acting on an explicit request of a user, it does deserve to get treated like any other user agent more or less.
It also would make some sense for various real-time-request-augmented bots to not only use the user agent string of the user's browser, but actually use the user's browser to make the request.
OpenAI managed to add this after a lot of complaining, but most AI scrapers lie about their user agent and ignore robots.txt. Plus, OpenAI gets to keep all the data from before they added this.
This is great, but what about the Common Crawl data they've used (or still get?), data that Bing might share with them, and other ways they acquire data past present and future? All of a sudden not so nicely labelled as GPTBot, is it?
Yeah this does work as long as the scraper respects robot.text
But dosnt openai and other companies use third party datasets? Like sure they do plenty of scraping but I'd bet for some stuff its cheaper to buy the dataset and then cleanup the data.