This blog post seems to blame GC heavily, but if you look back at their earlier ...

Aeolun · on Sept 29, 2024

But then it’s still nice that they’re using ScyllaDB and now it’s not a concern at all right?

Even if they were using their original solution wrong, I think the solution that cannot use wrong is superior.

ericvolp12 · on Sept 29, 2024

The funny part is ScyllaDB still uses tombstones for deletions, though they do have configurable compaction strategies and iirc Discord uses Scylla's Incremental Compaction Strategy that I suppose solves the specific issue they were dealing with. iirc that compaction strategy will trigger a compaction once a certain threshold of a partition is tombstones and then the table is rebuilt without the tombstoned content (which effectively pauses writes on that specific node and that specific table and partition for the duration of that process). Compacting a massive partition is really expensive. Scylla defaults to warning you that a partition is too large if it has at least 100,000 rows in it. My guess is when they moved to ScyllaDB they also adopted a new strategy for partitioning messages in a channel that keeps partition sizes reasonable so compactions don't take a super long time.

jhgg · on Sept 29, 2024

We did not change schema or partitioning strategy.

sroussey · on Sept 29, 2024

Good default configurations can mean quite a lot if people don’t tune them.

roenxi · on Sept 29, 2024

I don't see anything here that looks untoward. They increased their data storage by 3 orders of magnitude and decided to use a different DB system. Fair enough, maybe they've learned more about the nature of their data.

But that logic isn't sound. When dealing with huge amounts of data there are going to be trade-offs. Picking a system that makes different trade-offs to an existing system is not automatically helpful. Yes you don't have the old problems. However, you are about to discover new problems. There is always something of a gamble around which will be more of a problem to your business.

frr149 · on Sept 30, 2024

What's the problem with Scylla? Honest question, BTW

vips7L · on Sept 29, 2024

> having just switched over from CMS (!)

This is really interesting. CMS was removed in Java 14 after being replaced by G1GC in Java 9. They were probably running an antiquated Java 8 or 11 runtime. So that means that in 2022 they were either running a 4 year old Java 11 runtime or an 8 year old Java 8 runtime. They were really leaving a lot of performance on the table.

gorset · on Sept 29, 2024

They could also have gone the commercial route and gotten Zing with their pauseless GC. It’s been around forever and they even cover Cassandra in their marketing.

https://www.azul.com/technologies/cassandra/

pebal · on Oct 7, 2024

This is not pauseless GC.