Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think he means smarter: given a bunch of CMS pages which are text content (different per page) surrounded by (semi-fixed) boilerplate, extract all the content nicely for re-importation.

It's a bit of a one-off, though.



Try a combination of Curl/wget/httrack with Pup (https://github.com/ericchiang/pup/)


Thanks everyone. Httrack is awesome, but yes, I mean smarter. Pup looks cool. I want the result to be something that turns 200 pages of staff bios into something I can pay someone $15/hr to copy-paste quickly into the new CMS. Boilerpipe does it nicely, but doesn't do the whole job without wget and some scripting, plus it costs money or is complicated (it's in Apache Tika, I guess).

But back on-topic, all I really mean to say is that something like this happens to me like every other week. Productize your scripts.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: