Regardless of whether Redis is meant to hold ephemeral data in memory, it is well-known to anybody who has had to maintain a Redis instance as part of their deployment that Redis is only effective when used for ephemeral in-memory work.
I understand your defensiveness, but please understand that because of your massive experience with Redis, either you have never lost data with Redis, which means that nobody besides you understands how to run your software in production, or you have lost data with Redis, which means that it's somewhat hypocritical of you to insist that Redis is durable.
Your comment seems to suggest that everyone that's ever used Redis except antirez has lost data. I don't think that's true. I've used Redis since 2012 across dozens of production services and never lost data, and I don't have any particular skill that makes me special.
Note that Redis only provides best effort consistency: it means that sometimes it can lose acknowledged data in special conditions (during failovers, or during restarts with a relaxed fsync policy). So it will never pass Jepsen tests in the default setup, but may pass it only when Redis is used with a linerizable algorithm on top of it that uses Redis as a state machine for a consensus protocol. But this does not mean that people haven't applied Redis for data storage with success. For instance many *SQL failover solutions, also can't pass Jepsen tests, yet people use them to store real world data. There are a lot of applications where to be able to really scale well and with a low cost (spinning sometimes 1/10 or 1/100 of nodes) makes perfectly viable to pick a system that is designed for that, and as a price, will lose a window of acknowledged writes during failures, while trying hard to avoid it in common failure scenarios.
Just going from the last one I read for DGraph, it did extremely well. Pretty sure etcd did well.
They always have bugs somewhere, but there are huge differences between bugs that show up for very specific, niche cases, and normal "I wrote to the db and it dropped it".
"We found five safety issues in version 1.1.1—some known to Dgraph already—including reads observing transient null values, logical state corruption, and the loss of large windows of acknowledged inserts."
Loss of large windows of acknowledged inserts. Durability is hard.
As staticassertion is mentioning, some of the violations that were found were only around tablet moves, which happen only in certain cluster sizes and quite infrequently. Of course, Jepsen triggers those moves left-right-and-center to evoke some of those failure conditions; but that's not how tablet moves are supposed to work in real world conditions. This is different from other edge cases like process crashes, or machine failures, network partitions, clock skews, etc., which can and do happen. In those cases, Jepsen didn't find any violations.
We were planning to look into those tablet move issues and get them fixed up (shouldn't be that hard), but honestly, the chances of our users encountering them is so low that we de-prioritized that work over some of the other launches that we are doing.
But, we'll fix those up in the next few months, once we have more bandwidth.
I don't really feel like playing the quotes game... but, sure.
"All of the issues we found had to do with tablet migrations"
"ndeed, the work Dgraph has undertaken in the last 18 months has dramatically improved safety. In 1.0.2, Jepsen tests routinely observed safety issues even in healthy clusters. In 1.1.1, tests with healthy clusters, clock skew, process kills, and network partitions all passed. Only tablet moves appeared susceptible to safety problems."
No one is here to claim that anyone is getting through any kind of rigorous testing without bugs found. But there is a huge difference between "My extremely common write path + a partition = dropped transactional writes" and "Under very specific circumstances, with worst case testing, multiple partitions, and the db in a specific state, we drop writes".
There is an ocean between, say, mongodb's test results, and Dgraph's.
"If you use Redis as a queue, it can drop enqueued items. However, it can also re-enqueue items which were removed. "
"f you use Redis as a database, be prepared for clients to disagree about the state of the system. Batch operations will still be atomic (I think), but you’ll have no inter-write linearizability, which almost all applications implicitly rely on."
"Because Redis does not have a consensus protocol for writes, it can’t be CP. Because it relies on quorums to promote secondaries, it can’t be AP. What it can be is fast, and that’s an excellent property for a weakly consistent best-effort service, like a cache."
Again, Redis is a very different type of database, so expectations should be aligned. Further, this test is quite old.
But that's a huge difference from DGraph's results.
Basically, saying "Well no one does well on Jepsen" isn't really true. Lots of databases do well, but you have to adjust your definition of "do well".
I understand your defensiveness, but please understand that because of your massive experience with Redis, either you have never lost data with Redis, which means that nobody besides you understands how to run your software in production, or you have lost data with Redis, which means that it's somewhat hypocritical of you to insist that Redis is durable.