What's cool about these statistics-based approaches is that you mostly don't eve...

		makmanalp on Aug 31, 2017 \| parent \| context \| favorite \| on: Cost-Based Optimizer in Apache Spark 2.2 What's cool about these statistics-based approaches is that you mostly don't even need fully up-to-date statistics, just overall decent stats, unless you have an insane amount of churn. Meaning - you can get query speedup without insertion overhead: you choose to take that overhead any time you want using ANALYZE. Very neat stuff from the databricks team!

For non latency sensitive queries you can also run dynamic sampling.