Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Try a combination of Curl/wget/httrack with Pup (https://github.com/ericchiang/pup/)


Thanks everyone. Httrack is awesome, but yes, I mean smarter. Pup looks cool. I want the result to be something that turns 200 pages of staff bios into something I can pay someone $15/hr to copy-paste quickly into the new CMS. Boilerpipe does it nicely, but doesn't do the whole job without wget and some scripting, plus it costs money or is complicated (it's in Apache Tika, I guess).

But back on-topic, all I really mean to say is that something like this happens to me like every other week. Productize your scripts.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: