Hacker News new | past | comments | ask | show | jobs | submit login

Yes, but the point I'm trying to make is that I think "isolation" is the wrong approach for data quality issues. That's like fishing out a turd. You need to quarantine the lot until you fix the upstream.

Automating that process (invalidation followed by filtering) is even worse, it will merely mask data quality issues when you want the opposite, to grind everything to a halt until you get workable and realistic data.

Perhaps I'm not "realistic" enough about real world data to work in data analytics. The frustration from ignoring bad data was a large part of why I quit data analysis for software development around 15 years ago.




For sure, you'd generally want to address the source of the issue rather than band-aid it, but that isn't always possible (or not possible immediately) and you just have to work with what you've got. Most of the time your boss isn't going to let you "grind everything to a halt".

But, more importantly, I think the blog is focused more on the practical "here's how to do x with some code", and less about the theory of data science.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: