Hacker News new | past | comments | ask | show | jobs | submit login
Apache Druid 0.18 (imply.io)
40 points by wochiquan on May 21, 2020 | hide | past | favorite | 29 comments



In case you are wondering what it was, after several clicks I found this:

A modern cloud-native, stream-native, analytics database

Druid is designed for workflows where fast queries and ingest really matter. Druid excels at instant data visibility, ad-hoc queries, operational analytics, and handling high concurrency. Consider Druid as an open source alternative to data warehouses for a variety of use cases.



Apache seems to have 100 databases and data processing projects under its umbrella. Is there a cheat-sheet on what they do (and sometimes how they differ) and how popular they are?


Yeah, only few are actually used widely with enough feature support.

It can be divided into a few categories: 1. OLTP, thing that requires frequent updates HBase, CouchDB 2. Distributed key-value store Cassandra 3. OLAP, analytical, batch Hive, Impala 4. ETL, stream processing Spark 5. OLAP, analytical, low-latency Druid

There are a bunch of other auxiliary projects that makes deploying those things at scale feasible, such as ZooKeeper, HDFS what not....



This seems to be the right product for loads of systems. I haven't seen much take up though.


It has significant operational load and complexity. Five node types, communicating with zookeeper


With Kafka's removal of ZooKeeper I wonder if Druid could piggy back off of that to simplify their architecture. At least using some of the Kafka terminology would be a step in the right direction. I think even a hard dependency of Kafka would be better for the adoption rate than their current design.


I don't know much about it other than what I've learned through the struggle of trying to set up a cluster.

But why do so many projects depend on Zookeeper? What does it provide that couldn't be done through a embedded library? Seems like a lot of databases don't really need it. Is it worth the extra network dependency and operational complexity?


Your question could be rephrased: why do so many projects depend on an external store for distributed consensus?

One answer is that coupling the consensus part of the system with parts that do active work results in harmful resource conflicts. Those resource conflicts can cause consensus algorithms to fail or take much longer to return answers. Example: Java VM clocks misbehave if the process can't get enough CPU. This can cause systems like ZK to lose quorum.


> But why do so many projects depend on Zookeeper?

Hadoop ecosystem legacy. Most companies adopting tech like Druid were already running Hadoop and had Zookeeper as a result. Probably made sense to take advantage of a reliable, or at least well-known, system.


Have an upvote.


A lot of tools (especially in Hadoop) were doing the same things. So the idea is to share all that logic in its own dedicated service. Like so many other things in software the tradeoffs had unforeseen consequences. What is an optimization for that community became baggage to the newcomers. I think HashiCorp is one of the worst offenders of this thought (even though I love them and use several of their tools)


Maybe, the community is thinking about it. But probably not the most pressing problem to solve right now.


Perhaps you've heard of a little video site called Netflix:

https://netflixtechblog.com/how-netflix-uses-druid-for-real-...


netflix, twitter, BT, Walmart I believe...



Product manager at Imply for Druid. AMA...


Your example join output at https://imply.io/post/introducing-apache-druid-0-18-0 has the incorrect answers for the join.


I thought Druid lost the battle against ClickHouse long ago. Am I wrong?


Incorrect. They both have their strengths and weaknesses:

https://medium.com/@leventov/comparison-of-the-open-source-o...


> ClickHouse is simpler and has less moving parts and services.

sounds good to me


Having administered Druid for several years, ClickHouse's supposed simplicity is definitely appealing were I to start a new project with similar requirements. Then again, back then, I needed Petabyte scale and > 1 million inserts/sec and ClickHouse couldn't do it.


I managed to do 5m/s in a single server. Something must be off.


Depends on how big your structs are coming in, how you are holding on to the data, and what else you are doing with it on ingestion.


I've been using ClickHouse in production to do things I just couldn't do with any other technology, it's not only simpler.


AFAIK, there is a comparable benchmark done on the latest versions for both. Druid definitely has a lot more production deployments as of today though.


You're right. Clickhouse is the kind of viral grass-roots tech that eventually spontaneously appears in every enterprise.


And that happens because it's very easy to work with and gives value to business on the day 1.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: