Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Nearly all browsers, scrapers, etc use the same user agent these days. Tools such as curl and wget are the only ones that come to mind off the top of the head that don't do that out of the box.



There was a discussion some days ago about one of the AI companies using a very characteristic user agent string for web crawling, but a more browser-like one for web browsing performed at the behest of the user. And there were some pertinent points there—if the AI bot is acting on an explicit request of a user, it does deserve to get treated like any other user agent more or less.


It also would make some sense for various real-time-request-augmented bots to not only use the user agent string of the user's browser, but actually use the user's browser to make the request.


AI scrapers are pretty widely ignoring robots.txt, and plenty lie about their user agents. https://rknight.me/blog/perplexity-ai-is-lying-about-its-use...

I'd fully expect OpenAI to do some checks that their bot isn't getting different responses than a seemingly real request.


OpenAI managed to add this after a lot of complaining, but most AI scrapers lie about their user agent and ignore robots.txt. Plus, OpenAI gets to keep all the data from before they added this.


This is great, but what about the Common Crawl data they've used (or still get?), data that Bing might share with them, and other ways they acquire data past present and future? All of a sudden not so nicely labelled as GPTBot, is it?


Yeah this does work as long as the scraper respects robot.text

But dosnt openai and other companies use third party datasets? Like sure they do plenty of scraping but I'd bet for some stuff its cheaper to buy the dataset and then cleanup the data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: