In case you are wondering what it was, after several clicks I found this:
A modern cloud-native, stream-native, analytics database
Druid is designed for workflows where fast queries and ingest really matter. Druid excels at instant data visibility, ad-hoc queries, operational analytics, and handling high concurrency. Consider Druid as an open source alternative to data warehouses for a variety of use cases.
Apache seems to have 100 databases and data processing projects under its umbrella. Is there a cheat-sheet on what they do (and sometimes how they differ) and how popular they are?
Yeah, only few are actually used widely with enough feature support.
It can be divided into a few categories:
1. OLTP, thing that requires frequent updates
HBase, CouchDB
2. Distributed key-value store
Cassandra
3. OLAP, analytical, batch
Hive, Impala
4. ETL, stream processing
Spark
5. OLAP, analytical, low-latency
Druid
There are a bunch of other auxiliary projects that makes deploying those things at scale feasible, such as ZooKeeper, HDFS what not....
With Kafka's removal of ZooKeeper I wonder if Druid could piggy back off of that to simplify their architecture. At least using some of the Kafka terminology would be a step in the right direction. I think even a hard dependency of Kafka would be better for the adoption rate than their current design.
I don't know much about it other than what I've learned through the struggle of trying to set up a cluster.
But why do so many projects depend on Zookeeper? What does it provide that couldn't be done through a embedded library? Seems like a lot of databases don't really need it. Is it worth the extra network dependency and operational complexity?
Your question could be rephrased: why do so many projects depend on an external store for distributed consensus?
One answer is that coupling the consensus part of the system with parts that do active work results in harmful resource conflicts. Those resource conflicts can cause consensus algorithms to fail or take much longer to return answers. Example: Java VM clocks misbehave if the process can't get enough CPU. This can cause systems like ZK to lose quorum.
> But why do so many projects depend on Zookeeper?
Hadoop ecosystem legacy. Most companies adopting tech like Druid were already running Hadoop and had Zookeeper as a result. Probably made sense to take advantage of a reliable, or at least well-known, system.
A lot of tools (especially in Hadoop) were doing the same things. So the idea is to share all that logic in its own dedicated service. Like so many other things in software the tradeoffs had unforeseen consequences. What is an optimization for that community became baggage to the newcomers. I think HashiCorp is one of the worst offenders of this thought (even though I love them and use several of their tools)
Having administered Druid for several years, ClickHouse's supposed simplicity is definitely appealing were I to start a new project with similar requirements. Then again, back then, I needed Petabyte scale and > 1 million inserts/sec and ClickHouse couldn't do it.
AFAIK, there is a comparable benchmark done on the latest versions for both. Druid definitely has a lot more production deployments as of today though.
A modern cloud-native, stream-native, analytics database
Druid is designed for workflows where fast queries and ingest really matter. Druid excels at instant data visibility, ad-hoc queries, operational analytics, and handling high concurrency. Consider Druid as an open source alternative to data warehouses for a variety of use cases.