Hacker News new | past | comments | ask | show | jobs | submit login

I do a lot of scraping for my day job. We have a business intelligence team that will build us reports that we need from the data that we have. However I find that this process is so incredibly slow and sometimes we only need to compile the data for a one-off project. I used to use vb.net for this as that's what I started learning programming with. Now I use python/requests/bs4 for all my scraping scripts.

I've started working on a new website that will use data scraped from several vbulletin forums. I've found that even 2 vbulletin forums running the same version may have completely different html to work with. I'm assuming that it's the templates they are using that changes it so much.

I'm setting up the process so that the webscraping happens from different locations than the server were the site is hosted. The scraping scripts upload to the webserver via an api I've built for this. Mostly did this because for now I'm just using a free pythonanywhere account and their firewall would block all of this without a paid account. And then also none of these sites would see the scraping traffic coming from my website, etc...




At Diffbot, we have an automated discussion thread parser, currently in beta testing, that might be exactly what you need. Send me a note at mike@diffbot.com and I'd be glad to hook you up.

(disclosure: I work there)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: