Hacker News new | past | comments | ask | show | jobs | submit login
Cost-Based Optimizer in Apache Spark 2.2 (databricks.com)
98 points by dmatrix on Aug 31, 2017 | hide | past | favorite | 5 comments



If you are interested in the code behind this, I wrote an overview last month on the functionality and links to the different code that backs the improvements they talk about: http://hydronitrogen.com/spark-220-cost-based-optimizer-expl...

There's a fair amount of overlap, but where the databricks article explains the techniques with charts and high level explanations, I go over the code instead.


On this topic, I really like the Join Order Benchmark paper: http://www.vldb.org/pvldb/vol9/p204-leis.pdf

It basically shows that most cost-based optimizers are pretty bad at cardinality estimation, which compounds when queries use more joins.


Still catching up with postgres which added multivariate column statistics in 9.6 :)

Not that this isn't a great development in itself...


What's cool about these statistics-based approaches is that you mostly don't even need fully up-to-date statistics, just overall decent stats, unless you have an insane amount of churn. Meaning - you can get query speedup without insertion overhead: you choose to take that overhead any time you want using ANALYZE.

Very neat stuff from the databricks team!


For non latency sensitive queries you can also run dynamic sampling.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: