Hacker News new | past | comments | ask | show | jobs | submit | wrath's comments login

I've built crawlers that retrieve billions of web pages every month. We had a whole team working modifying the crawlers to resolve website changes, to reverse engineer ajax requests and solve complex problems like captcha solvers. Bottom line, if someone wants to crawl your website they will.

What you can do, however, is make it hard so that the vast majority of developers can't do it (e.g. My tech crawl billions of pages, but there was a whole team dedicated to keeping it going). If you have money to spend, there's Distill Networks or Incapsula that have good solutions. They block PhantomJS and browsers that use Selenium to navigate websites, as well as rate limit the bots.

What I found really affective that some websites do is tarpit bots. That is, slowly increase the number of seconds it takes to return the http request. So after a certain amount of request to your site it takes 30+ seconds for the bot to get the HTML back. The downside is that your web servers need to accept many more incoming connections but the benefit is you'll throttle the bots to an acceptable level.

I currently run a website that gets crawled a lot, deadheat.ca. I've written a simple algorithm that tarpits bots. I also throw a captcha every now and then when I see an IP address hits too often over a span of a few minutes. The website is not super popular and, in my case, it's pretty simple to differentiate between a human or bot.

Hope this helps...


Do you feel bad at all about apparently making a business out of crawlers, but still apparently viewing it as bad enough that you want countermeasures against it? Don't you feel a slight bit hypocritical about this?


I don't think I'm being hypocritical. I have no issues if people crawl my site, I even identify who they are and give them access to a private API. I do not generate any income from that website though. I provide a service because I love doing it and I cover all the costs. Bots do increase my cost so I choose to limit their activity. Crawl me, but do so using my rules.

One of the alternatives is charging for my service but bots are not my users problem, they are mine.


> I've built crawlers that retrieve billions of web pages every month.

Wow, what were you doing with the data?


Competitive intelligence.

Crawling thousands of website, mashing up the data to analyze competitiveness between them, and selling it back.

For example, cost of flights. Different websites provide different prices for the same flight. The technology crawls all the prices, combines the data, then resells it back to the websites. Everyone knows everyones prices, keeps competition high, lower prices for consumers.


Travel companies pay their GDS for every search they do. It costs so much that it's the primary cost centre for some of them. You were costing them thousands of dollars a day.


GDS?


Good.


If unwanted scraping can be distinguished from legitimate traffic, wouldn't a sort of honeypot strategy work such that you then provide those requests you've identified as likely to be unwelcome with fake or divergent data?


When websites get a ton of traffic the concern is that the algorithm that find fake data will not be accurate and start blocking paying customers. So it's a fine line between blocking paying customers and fake data. What these algorithms do instead of blocking is to throw captcha's so if the traffic is really human, the captcha can be solved. The bigger problem is that there's a good chance that humans who are thrown a captcha will leave to go buy somewhere else (either because they are lazy, the captcha is hard, etc...).

Solutions like CloudFare and Distill have sophisticated algorithm to balance out fake and real data but even they are not close to being perfect.


How do you prevent your website from not functioning over time for legitimate users, though? I'm a Sysadmin & not a coder or developer, so the tricks you can do are a little foreign to me. Can you provide examples? Why don't Adidas/Nike/et. al. do this to fight the likes of sneakerbots?


Did you respect robots.txt?


It sounds to me like an obvious no, if they have a large team to get around countermeasures.


Considering the effort that went into it.

I am pretty sure crawling robot.txt links was their P1 requirement.


+1 for incapsula or cloudflare.

BTW, interested in learning a bit more about your stack, we are on the same route but at smaller scale.


Very curious about this type of work. Is there is a good way to contact you to discuss this topic?


How do you bypass google recaptcha


I can't provide details on any innovations we've done with sites like google, but in general if you want to crawl google you'll want to get "many, many" IP addresses. I've heard of people using services like 2captcha.com but the best way is to obfuscate who you are.

If you can hit Google 60 times per minute per IP before getting blocked and you need to crawl them 1000 times per minute, you need 17 IPs per hour. Randomize headers to look like real people coming from schools, office buildings, etc... Lots of work but possible.


I do it using rotating proxies, stripping cookies between requests, randomly varying the delay between requests, randomly selecting a valid user-agent string, etc. It's a pain in the butt. And to scrape more than I do, faster than I do, would be pretty freaking expensive in terms of time and money.

Note that Google is pretty aggressive about captcha-ing "suspicious" activity and/or throttling responses to suspicious requests. You can easily trigger a captcha with your own manual searching. Just search for something, go to page 10, and repeat maybe 5-20 times and you'll see a captcha challenge.

If Google gets more serious about blocking me then I'll use ML to overcoming their ML (which should be doable because they're always worried about keeping Search consumer-friendly).


If you do go the ML route, I recommend TensorFlow + Google Cloud (Both for the cost performance, and the irony).


There are services that do this with humans for pennies. (A service I've used charges $2/1000)


Mechanical turk


360pi, Ottawa Ontario, Canada Company Overview: 360pi helps top retailers compete and win in an era when shoppers expect and demand complete price transparency. Our customer base of "brick & mortar", e-commerce, and multichannel retailers accounts for over $US100 billion in annual retail sales and includes Ace Hardware, Best Buy Canada, build.com, TrueValue, and Guitar Center, among others.

We are looking senior and junior developers to help us write crawlers and products that will reshape the retail industry. If you are interested, see the links below. You can email any questions at dominic@360pi.com

Apply here: http://360pi.applytojob.com/apply/rJ6rlG5osz/Software-Develo...


I think it's that's a very good rate of growth for a new service. The scaling challenges alone are not obvious and slowly taking up market share is the safer thing to do. If I was an investor seeing these growth rates, I'd be really happy as long as it keeps up and only plateaus when the goal is reached.


I think @nunobrito covered most of what I wanted to say.

Without knowing your product it's hard to give you the best advice, but here are things that worked for me in the past:

* Use real data that they can relate to. For example, if you're software does case management for non-profits. Find out who their customers are and create an "fake" example of how your software would work to manage a case for that company, from start to finish. This is partly for the 3 minute presentation, but more so for the barrage of questions you'll get asked afterwards once they like what they see. It's always easier to sell something when you find a solution to a pain your customer has right now, with one of his customers!

* Use only the terminology that they use and understand. Similar to #2, use the language that your customer speaks. Know your user and which language and acronyms they are using on a daily basis. I've seen so many times young startups use scientific terms or executive mumbo jumbo that the audience didn't understand.

* The 3 minutes is only the beginning. Whether you succeed or fail, it's not the end. Before, during, and after, identify the lead of the group that will be using your software and try to identify the decision maker. The person who signs the check if often different then the person evaluating your software. This is a sales cycle and they are both important for you to convince. Get a meeting with them, figure out what makes them tick, what their budget is, when will they be in a position to evaluate your software, etc..

Hope this helps a little, good luck!


I would politely insist that an email be sent out by the founder about your departure before the end of day. That way you get a chance of talking to your co-workers and explain the situation in your own words. To be clear, always be polite and never talk disparagingly about the company and co-founder, but the fact that you can't have a proper goodbye is a little strange. I've learned that it's always better to rip off the band-aid as quickly as possible while agreeing on an acceptable and honest message with the employee who is leaving. Better late than never I guess, so I would ask for this first thing this morning.

If it happens, send an email at the end of the day thanking everyone for the time you had with them, the knowledge you gain and the fun you had. Give your personal email and phone number in case there are any questions and/or someone wants to keep in touch with you.


I actually drafted my "goodbye email" yesterday. :) And it was nothing but gracious and positive about the company and its potential, along with my personal contact info. I have no intention of disparaging the company or founder or anything - I'm more professional than that! ^_^


If these guys are serious about their business they are using proxies to obfuscate who they are. There are many services now that are claiming to have a ton of IP addresses Luminati, Shadio.io and Nohodo are just a few examples.


Sorry, I'm having trouble finding that second link: shadio. Can you confirm that's their address? Interested in checking them out


Sorry, my bad... it's shader.io


Awesome, thank you. They do look interesting, but their pricing confuses me. Are they just selling individual proxies? Or does $1.80 get you single exit nodes or threads basically?


As far as I know, shader.io sells individual proxies. Luminati and Nohodo are exit nodes priced by the amount of bandwidth that you use.


If these guys are serious about their business, then they should behave like a good netizen and obey robots.txt directives.


360pi Full-time onsite in Ottawa, Canada. http://360pi.com

360pi is a company organizing the worlds product data for retailers, brands and consumers. World class dev team, leading edge challenges in cloud, scaling, data and AI. We’re looking for infrastructure/ops developers, QA and data gurus.

Apply here.. http://360pi.com/careers/


I love and miss RegexBuddy! If only they had a Mac/Linux version :( I've had thoughts about switching back to Windows while developing complex regular expressions.


Regexbuddy can be run on a Mac using Wine.


I've built several businesses that either relied in-part to scrapping/indexing websites or solely relied on scrapping/index websites. I We never achieved the success of Google but we did get large enough to be noticed by some sites (Amazon for example). We did face legal issues but of a different kind. There were a few bugs early on that made us hit websites too much and we did receive a couple of cease and desist letters. We fixed our problem and explained the situation to the site owner and everything was resolved.

The only "fair use" type issue that we encountered was using logos from websites. E.g. Displaying the logos of the websites we indexes on our site. Once again, nothing serious came of it. I believe our marketing department removed the logos and put text instead.

Personally, I wouldn't worry about these issues until it becomes a problem. When it becomes a problem it means you're on to something and you're disruptive enough to get some attention. It's a good problem to have IMO.


360pi, Ottawa Ontario, Canada

Company Overview: 360pi helps top retailers compete and win in an era when shoppers expect and demand complete price transparency. Our customer base of "brick & mortar", e-commerce, and multichannel retailers accounts for over $US100 billion in annual retail sales and includes Ace Hardware, Best Buy Canada, build.com, TrueValue, and Guitar Center, among others.

We are looking for young and talented developers to help us write crawlers and products that will reshape the retail industry. If you are interested, see the links below. You can email any questions at dominic@360pi.com

http://360pi.theresumator.com/apply/


"Young"? Really?


I apologize if I offended anyone and/or if I said anything improper; it was not my intention. I wrote this quickly during the March "Who's hiring" thread and copy/pasted it today. The role that was linked last month is a very early entry, out of college, role. 99% of our applicants for these roles are under 30 so I can only guess it was a subconscious thing on my part to put "young" in the description.

Thanks for holding me accountable!


pretty sure asking for "young" developers is against the EEOC, bud...


EEOC age discrimination provisions specify workers 40 and older (https://en.wikipedia.org/wiki/Age_Discrimination_in_Employme...)

I agree that it's distasteful but it's not actually illegal in the US (and of course poster is in CA.)


does the EEOC apply to CA?


EEOC does not but Canadian Human Rights Act does, and it forbids discrimination on the basis of age, unless it is proven to be BFOR (bona fide occupational requirement).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: