Regardless of whether Redis is *meant* to hold ephemeral data in memory, it is w...

rpdillon · on May 10, 2020

Your comment seems to suggest that everyone that's ever used Redis except antirez has lost data. I don't think that's true. I've used Redis since 2012 across dozens of production services and never lost data, and I don't have any particular skill that makes me special.

tomnipotent · on May 10, 2020

> that Redis is only effective when used for ephemeral in-memory work

Hogwash and there are plenty of examples out there that prove otherwise.

> somewhat hypocritical of you to insist that Redis is durable

If Jepsen has demonstrated anything, it's that no database is as durable as it claims to be.

staticassertion · on May 10, 2020

There are lots of dbs that do very well under Jepsen tests.

antirez · on May 10, 2020

Note that Redis only provides best effort consistency: it means that sometimes it can lose acknowledged data in special conditions (during failovers, or during restarts with a relaxed fsync policy). So it will never pass Jepsen tests in the default setup, but may pass it only when Redis is used with a linerizable algorithm on top of it that uses Redis as a state machine for a consensus protocol. But this does not mean that people haven't applied Redis for data storage with success. For instance many *SQL failover solutions, also can't pass Jepsen tests, yet people use them to store real world data. There are a lot of applications where to be able to really scale well and with a low cost (spinning sometimes 1/10 or 1/100 of nodes) makes perfectly viable to pick a system that is designed for that, and as a price, will lose a window of acknowledged writes during failures, while trying hard to avoid it in common failure scenarios.

tomnipotent · on May 11, 2020

> restarts with a relaxed fsync policy

I can't think of a single database that's solved the fsync/O_DIRECT issue on Linux completely. Postgres had to patch it _again_ last year.

staticassertion · on May 10, 2020

Sure, I was speaking more broadly. Jepsen isn't a meaningful tests for a lot of databases, or use cases/configurations of databases.

tomnipotent · on May 10, 2020

I can't think of a single Jepsen test that hasn't demonstrated issues with data loss.

staticassertion · on May 10, 2020

Just going from the last one I read for DGraph, it did extremely well. Pretty sure etcd did well.

They always have bugs somewhere, but there are huge differences between bugs that show up for very specific, niche cases, and normal "I wrote to the db and it dropped it".

tomnipotent · on May 11, 2020

From the latest test:

"We found five safety issues in version 1.1.1—some known to Dgraph already—including reads observing transient null values, logical state corruption, and the loss of large windows of acknowledged inserts."

Loss of large windows of acknowledged inserts. Durability is hard.

mrjn · on May 12, 2020

(author of Dgraph here)

As staticassertion is mentioning, some of the violations that were found were only around tablet moves, which happen only in certain cluster sizes and quite infrequently. Of course, Jepsen triggers those moves left-right-and-center to evoke some of those failure conditions; but that's not how tablet moves are supposed to work in real world conditions. This is different from other edge cases like process crashes, or machine failures, network partitions, clock skews, etc., which can and do happen. In those cases, Jepsen didn't find any violations.

We were planning to look into those tablet move issues and get them fixed up (shouldn't be that hard), but honestly, the chances of our users encountering them is so low that we de-prioritized that work over some of the other launches that we are doing.

But, we'll fix those up in the next few months, once we have more bandwidth.

staticassertion · on May 11, 2020

I don't really feel like playing the quotes game... but, sure.

"All of the issues we found had to do with tablet migrations"

"ndeed, the work Dgraph has undertaken in the last 18 months has dramatically improved safety. In 1.0.2, Jepsen tests routinely observed safety issues even in healthy clusters. In 1.1.1, tests with healthy clusters, clock skew, process kills, and network partitions all passed. Only tablet moves appeared susceptible to safety problems."

No one is here to claim that anyone is getting through any kind of rigorous testing without bugs found. But there is a huge difference between "My extremely common write path + a partition = dropped transactional writes" and "Under very specific circumstances, with worst case testing, multiple partitions, and the db in a specific state, we drop writes".

There is an ocean between, say, mongodb's test results, and Dgraph's.

Read Redis's evaluation, for example: https://aphyr.com/posts/283-call-me-maybe-redis

"If you use Redis as a queue, it can drop enqueued items. However, it can also re-enqueue items which were removed. "

"f you use Redis as a database, be prepared for clients to disagree about the state of the system. Batch operations will still be atomic (I think), but you’ll have no inter-write linearizability, which almost all applications implicitly rely on."

"Because Redis does not have a consensus protocol for writes, it can’t be CP. Because it relies on quorums to promote secondaries, it can’t be AP. What it can be is fast, and that’s an excellent property for a weakly consistent best-effort service, like a cache."

Again, Redis is a very different type of database, so expectations should be aligned. Further, this test is quite old.

But that's a huge difference from DGraph's results.

Basically, saying "Well no one does well on Jepsen" isn't really true. Lots of databases do well, but you have to adjust your definition of "do well".

devsquid · on May 10, 2020

As with all things it totally depends on the scale you're at.

Redis is an extremely reliable service. I've never "lost" data with Redis.