Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That sounds like a very creative idea.

What I do is call a regression test every x minutes. If it fails, set a flag to save/store the html everytime we crawl pages. Now we can go back and process these saved pages when we fix our crawler



I crawl a specific site somewhere up to 50 unique URLs a day. I store both the unparsed full html as a file and the json I'm looking for as another separate file. The idea is if something breaks instead of taking a hit to make the call again, I have the data and I should just process that. It's come in extremely handy when a site redesign changed the DOM and broke the parser.

I do the same at $dayJob where I'm parsing results of an internal API. Instead of making a call later that may not have the same data, I store the json and just process that. I feel like treating network requests as an expensive operation, even though they're not really, helped me come up with some clever ideas I've never had before. It's a premature optimization considering I've had like 0.000001% of failure but being able to replay that one breakage made debugging an esoteric problem waaaaaay simpler than it would've been otherwise.


Off-topic: I so wish I worked for a company where my work involves scraping and storing and analyzing data. :(


Now is a good time to work in this field since data science is hot and companies need web scrapers to provide the data for these models. Atleast that has been my experience in finance. Try applying!


I have zero experience in data science though. I am a pretty solid and experienced programmer and can learn it all but... don't know. Maybe I should just try indeed.

Do you have any recommendations for places and/or interview practices?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: