BlinkDB: Queries with Bounded Errors and Response Times on Very Large Data

benhamner · on Aug 20, 2013

I'm thrilled that AMPLab and CSAIL are building this.

For the vast majority of analytics problems and projects I've worked on, approximate numbers are just as good as exact results. One of the biggest productivity blockers can be queries and analytics that take hours instead days to run, instead of seconds to minutes, as these dramatically decrease the number of iterations you can execute and ideas you can test.

We commonly work on sub-sampled versions of datasets to enable interactive queries and analytics - it's really great to see someone formalizing this process and handling the details in a simple and principled manner.

ksikka · on Aug 20, 2013

Very cool idea - glad it's being production-ized.

You might benefit from a different name though. "[word]DB" is starting to become a pattern in people's internal spam filters. And it reminds me of CouchDB, MongoDB, RethinkDB, etc.

Scaevolus · on Aug 19, 2013

This is an excellent solution to long latency tails, which become more and more noticeable at large scales. Here's a blog post discussing Google's experience with it: http://highscalability.com/blog/2012/3/12/google-taming-the-...

I expect it will be especially helpful for businesses analyzing data-- they can get useful results from massive datasets without massive hardware expenses.

PaulHoule · on Aug 19, 2013

I've often dealt with "big data" by using sampling and stratified sampling and it is nice to see they're building something that can automate this process.