Under normal circumstances Redis doesn't lose writes. If it did, we'd be able to detect it in metrics for cache misses or stale data. Our vector clock state machines wouldn't work. We have very fine monitoring for all of these classes of failure.
Operationally, we sequence Redis downtime events. They're very rare. This is when most would be concerned about losing data.
We shift traffic to be hitless. Our model is eventually consistent across multiple regions. We won't accept writes in a cluster that is going down.
The Redis replication chain was chosen so that the primary Redis doesn't have to fork (aofrewrite, bgsave) or even do disk IO (AOF ops in general). We let our offline replicas do that work. The primary Redis can focus on serving reads and writes.
We alert on replication chain breakdown. We sequence traffic draining and primary / secondary swap operations for things like OS or Redis upgrades.
It's pretty sophisticated and largely automated.
This model tolerates losses of Redis instances. Only writes to the primary occurring in the short time before replication while not failed out might be lost. But that number would be incredibly, incredibly small. We would tolerate such losses.
We've got a lot of nines of reliability, and this pattern has scaled and served us well.
> We've never encountered data loss issues with our read and write heavy Redis services
> We would tolerate such losses
In the context of the original concern, these are conflicting statements.
The underlying argument here is that you shouldn't use redis for anything you can't tolerate losing. Your use case and architecture is great and I'm glad it works well for you. But at the end of the day, there are workloads that can't tolerate such losses. And for those cases, redis is not a good fit.
Operationally, we sequence Redis downtime events. They're very rare. This is when most would be concerned about losing data.
We shift traffic to be hitless. Our model is eventually consistent across multiple regions. We won't accept writes in a cluster that is going down.
The Redis replication chain was chosen so that the primary Redis doesn't have to fork (aofrewrite, bgsave) or even do disk IO (AOF ops in general). We let our offline replicas do that work. The primary Redis can focus on serving reads and writes.
We alert on replication chain breakdown. We sequence traffic draining and primary / secondary swap operations for things like OS or Redis upgrades.
It's pretty sophisticated and largely automated.
This model tolerates losses of Redis instances. Only writes to the primary occurring in the short time before replication while not failed out might be lost. But that number would be incredibly, incredibly small. We would tolerate such losses.
We've got a lot of nines of reliability, and this pattern has scaled and served us well.