Apache Druid 0.18

SomeHacker44 · on May 21, 2020

In case you are wondering what it was, after several clicks I found this:

A modern cloud-native, stream-native, analytics database

Druid is designed for workflows where fast queries and ingest really matter. Druid excels at instant data visibility, ad-hoc queries, operational analytics, and handling high concurrency. Consider Druid as an open source alternative to data warehouses for a variety of use cases.

dang · on May 21, 2020

See also:

recent: https://news.ycombinator.com/item?id=22868286 and https://news.ycombinator.com/item?id=22739461

2016 https://news.ycombinator.com/item?id=11400681

a bit from 2014 https://news.ycombinator.com/item?id=7128091

a bit from 2012 https://news.ycombinator.com/item?id=4693224

Introduced in 2011 https://news.ycombinator.com/item?id=2501160

megaman821 · on May 21, 2020

Apache seems to have 100 databases and data processing projects under its umbrella. Is there a cheat-sheet on what they do (and sometimes how they differ) and how popular they are?

2bethere · on May 21, 2020

Yeah, only few are actually used widely with enough feature support.

It can be divided into a few categories: 1. OLTP, thing that requires frequent updates HBase, CouchDB 2. Distributed key-value store Cassandra 3. OLAP, analytical, batch Hive, Impala 4. ETL, stream processing Spark 5. OLAP, analytical, low-latency Druid

There are a bunch of other auxiliary projects that makes deploying those things at scale feasible, such as ZooKeeper, HDFS what not....

willseth · on May 21, 2020

https://projects.apache.org/projects.html?category

rb808 · on May 21, 2020

This seems to be the right product for loads of systems. I haven't seen much take up though.

logicslave · on May 21, 2020

It has significant operational load and complexity. Five node types, communicating with zookeeper

mrits · on May 21, 2020

With Kafka's removal of ZooKeeper I wonder if Druid could piggy back off of that to simplify their architecture. At least using some of the Kafka terminology would be a step in the right direction. I think even a hard dependency of Kafka would be better for the adoption rate than their current design.

doliveira · on May 21, 2020

I don't know much about it other than what I've learned through the struggle of trying to set up a cluster.

But why do so many projects depend on Zookeeper? What does it provide that couldn't be done through a embedded library? Seems like a lot of databases don't really need it. Is it worth the extra network dependency and operational complexity?

hodgesrm · on May 21, 2020

Your question could be rephrased: why do so many projects depend on an external store for distributed consensus?

One answer is that coupling the consensus part of the system with parts that do active work results in harmful resource conflicts. Those resource conflicts can cause consensus algorithms to fail or take much longer to return answers. Example: Java VM clocks misbehave if the process can't get enough CPU. This can cause systems like ZK to lose quorum.

tomnipotent · on May 21, 2020

> But why do so many projects depend on Zookeeper?

Hadoop ecosystem legacy. Most companies adopting tech like Druid were already running Hadoop and had Zookeeper as a result. Probably made sense to take advantage of a reliable, or at least well-known, system.

2bethere · on May 21, 2020

Have an upvote.

mrits · on May 21, 2020

A lot of tools (especially in Hadoop) were doing the same things. So the idea is to share all that logic in its own dedicated service. Like so many other things in software the tradeoffs had unforeseen consequences. What is an optimization for that community became baggage to the newcomers. I think HashiCorp is one of the worst offenders of this thought (even though I love them and use several of their tools)

2bethere · on May 21, 2020

Maybe, the community is thinking about it. But probably not the most pressing problem to solve right now.

uptown · on May 21, 2020

Perhaps you've heard of a little video site called Netflix:

https://netflixtechblog.com/how-netflix-uses-druid-for-real-...

blackswan101 · on May 21, 2020

netflix, twitter, BT, Walmart I believe...

advisedwang · on May 21, 2020

Official release notes at https://github.com/apache/druid/releases/tag/druid-0.18.0.

2bethere · on May 21, 2020

Product manager at Imply for Druid. AMA...

frankmcsherry · on May 22, 2020

Your example join output at https://imply.io/post/introducing-apache-druid-0-18-0 has the incorrect answers for the join.

pachico · on May 21, 2020

I thought Druid lost the battle against ClickHouse long ago. Am I wrong?

gilbetron · on May 21, 2020

Incorrect. They both have their strengths and weaknesses:

https://medium.com/@leventov/comparison-of-the-open-source-o...

rb808 · on May 22, 2020

> ClickHouse is simpler and has less moving parts and services.

sounds good to me

gilbetron · on May 22, 2020

Having administered Druid for several years, ClickHouse's supposed simplicity is definitely appealing were I to start a new project with similar requirements. Then again, back then, I needed Petabyte scale and > 1 million inserts/sec and ClickHouse couldn't do it.

pachico · on May 22, 2020

I managed to do 5m/s in a single server. Something must be off.

gilbetron · on May 24, 2020

Depends on how big your structs are coming in, how you are holding on to the data, and what else you are doing with it on ingestion.

pachico · on May 22, 2020

I've been using ClickHouse in production to do things I just couldn't do with any other technology, it's not only simpler.

2bethere · on May 21, 2020

AFAIK, there is a comparable benchmark done on the latest versions for both. Druid definitely has a lot more production deployments as of today though.

otabdeveloper4 · on May 21, 2020

You're right. Clickhouse is the kind of viral grass-roots tech that eventually spontaneously appears in every enterprise.

pachico · on May 22, 2020

And that happens because it's very easy to work with and gives value to business on the day 1.