Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A curious title.

"So you want to scrape like the unethical boys?" I guess doesn't scan so well. Bad boys maybe?

I'm pretty sure Internet Archive, etc don't in fact misrepresent what they are to crawl websites...



> "So you want to scrape like the unethical boys?"

What's considered ethical is a very debated topic.

An assertion that something is simply "unethical" should be seen as the starting point of a discussion, not as a self-evident fact.


If someone tells you to go away via the robots exclusion standard, and puts up bot mitigation to prevent you, blocks your IPs, etc. then clearly you do not have their consent to help yourself to the data.

I find it really hard to see how you could twist ignoring this clear lack of consent, and going to great lengths to circumvent what was clearly put into place to prevent you from doing the very thing you are doing, how you could twist that into an ethical action.

It may or may not be technically illegal to do, you're but that is not a statement about what is ethical.


Ok, you’re building a service that scrapes e.g., property rental websites to find entries that are trying to scam naive renters.

The property websites are incompetent to solve the problem, or don’t care, but either way they sure don’t want you scraping their valuable data.

Is it still unethical?


That just makes both of you wrong.


Agreed. It's kind of like when a non-profit organization argues that they are entitled to someone's data because "we're not making a profit off of it." That's ridiculous.

Try asking a startup for free software licenses or seats or whatever as a non-profit. "We're entitled to 40 seats of your SAAS solution because we're a non-profit working to solve world peace." It's definitely within the startup's pervue to respond with a no.


Surely the ethics are more complicated then just following robots.txt or not. The intended usage counts, and that isn't captured in robots.txt.


If you have a noble intent, you ask the webmaster for permission to use the data. Surely if they agree with your assessment that your intent is indeed noble, then you'll be given consent.

I run a search engine and an internet crawler. I do this all the time. To this date I've never had a webmaster that didn't permit my crawler access when I've asked nicely.


If you have a noble intent - identify members of fascist organizations - then obviously when you ask the top online fascist sites if you may scrape them to build up your list of online fascists - they will say no.

OK less provocative, you have new algorithm to identify inaccessible websites, your automation is scary good, crawling a site you can identify many issues that most sites would have to pay for a full audit to get, but now these sites have problems - if you can identify their sites as being inaccessible then they have to fix these problems due to various accessibility standards that apply in the regions they operate in. But if they don't allow you access then they can maybe make an argument they are accessible due to audit they did last year, at any rate they don't want to be forced to spend money on accessibility issues right now which it sounds like they might have to if they let you crawl their site.

Version 2 of above, some years ago I spoke about a job with a big time magazine publisher in Denmark and said one of the things that would make me a good employee is my knowledge of accessibility and their chief of development said they didn't have anyone with disabilities that used their site - so if I ask that guy to crawl their site why say yes? They have no users that would benefit!! Stop abusing our bandwidth bleeding heart guy.


All of these seem like variations of the-ends-justify-the-means, which generally tends to cut both ways in unanticipated ways.

Bullying websites into accessibility compliance will most likely lead to them following the letter of the standard without giving a second of thought as to whether the content is in fact actually accessible. It's very difficult to get someone on board with your cause if your initial contact is an antagonistic one.


This might work in cases where those with the data are engaged in noble acts, but not ever actor is.

I scrape and process websites of actors engaged in fraud. I do this to make the data more presentable to the proper authorities and to help uncover further evidence of their activities.

I suspect that asking for consent would be quickly denied and the data/evidence would quickly become inaccessible.


> If you have a noble intent, you ask the webmaster for permission to use the data.

Is Marginalia opt in, then? Surely "not having a robots.txt" ("you didn't say no!") does not equal consent. And surely you could just ask all the webmasters you are scraping from for permission, since you have noble intent.

My point is that this is just hypocritical; you are placing the moral boundary right below what you are doing, while claiming moral superiority. If you ask others (e.g. anti-search Fediverse), they would think you are immoral too.


You really see no difference between following the robots exclusion standard, doing nothing to conceal your origins and intents, and respecting blocks when they appear; vs concealing your origins and intents, willfully ignoring the robots exclusion standard, and going to great lengths to circumvent IP blocks and other bot mitigation measures?

Both of these are the same?


Why do you think the consent of someone relaying information matters in the slightest when it comes to what people do with that information?


> unethical

Using and transforming information in useful ways is unethical if it results in a profit?

That's what our brains do, too.


No, destroying incentives to produce and share information is unethical (and more importantly, self-defeating).

Brains that consume information don’t destroy that incentive, they produce it.

Intermediating that and capturing all of the value for yourself is the unethical part, just like all forms of rent-seeking.


> destroying incentives

Internet usage and content creation are increasing, not decreasing.

I continue to publish comments, code, and images that presumably get used to train models. My incentive hasn't been destroyed.

> rent-seeking

Supply and demand set the prices.

Subscription services provide value and continue to invest in their product, catalog, and/or service. Property owners handle asset ownership and upkeep problems at scale.

Inefficiencies will be met with competition, and businesses not providing value will be out-competed.

Data under-availability is an inefficiency holding us back from bigger and better things.


Tell that to the most used website in the world, which is basically a scrapping-and-sorting machine.


I can commit a code change in 2 seconds that would directly tell the most used website in the world to stop scrapping and sorting my data, and they would honor it and that would be the end of that.

I'm under no illusions that they would or would not honor that in the future, but that's the state today.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: