Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One that is not mentioned here but that I like as a general principle is that you cannot have exactly once delivery. At most once or at least once are both possible, but you have to pick your failure poison and architect for it.


*Between two distributed systems that don't share the same transactionality domain or are not logically monotone.

It's easy to see that moving data from one row to another is doable in a clustered database, and could be interpreted as a message being delivered.

The point is that you can get exactly once delivery, if your whole system is either idempotent, or you can treat the distributed system as one single unit that can be rolled back together (i.e. side effect free wrt. some other system outside the domain).

Both are cases of some form of logical monotonicity, idempotence is easier to see, but transactionality is also based on monotonicity through the used WALs and algorithms like Raft.

The article should really mention CALM (Consistency as logical monotnicity), it's much easier to understand and a more fundamental result than CAP. https://arxiv.org/pdf/1901.01930


> The point is that you can get exactly once delivery, if your whole system is either idempotent

If you have exactly-once delivery, there's no difference between being idempotent or not. The only effect of idempotence is to ignore extra deliveries.

If idempotence is a requirement, you don't have exactly-once delivery.


Idempotence allows you to build systems that in their internal world model allow for exactly once delivery of non-idempotent messages over an at least once medium.

Exactly-once delivery semantics doesn't equal exactly-once physical messages.


Can’t stress this enough. Thank you for mentioning it!! In my career I’ve encountered many engineers unfamiliar with this concept when designing a distributed system.


> I’ve encountered many engineers unfamiliar

tbf, Distributed Systems aren't exactly easy nor common, and each of us either learn from others, or learn it the hard way.

For example, and I don't mean to put anyone on the spot, Cloudflare blogged they'd hit an impossibly novel Byzantine failure, when it turned out to be something that was "common knowledge": https://blog.cloudflare.com/a-byzantine-failure-in-the-real-... / https://archive.md/LK8FI


Yeah it’s very common since it becomes an issue as soon as you get sockets involved in an app. A lot of frontend engineers unknowingly end up in distributed system land like this. Coordinating clients can be just as much a distributed systems challenge as coordinating servers—often it’s harder.


It's impossible to have at least once delivery in an environment with an arbitrary level of network failure.


You can have at-least-once-or-fail though. Just send a message and require an ack. If the ack doesn’t come after a timeout, try n more times, then report an error, crash, or whatever.


Sure you can: keep sending one message per second until you receive a response from the recipient saying "OKAY I HEARD YOU NOW SHUT UP!"


The important part of this lesson is "and you don't need it".


Apache Flink does provide end-to-end exactly-once guarantees when coupled with data sources and data sinks that participate in its checkpointing mechanism. See:

- An Overview of End-to-End Exactly-Once Processing in Apache Flink (with Apache Kafka, too!)https://flink.apache.org/2018/02/28/an-overview-of-end-to-en...

- Flink's Fault Tolerance Guaranteeshttps://nightlies.apache.org/flink/flink-docs-release-1.20/d...


Exactly once processing != exactly once delivery


And on top of that this _end-to-end exactly-once guarantees_ makes the result look like the message was processed in the pipeline only once, while in reality a message will be processed 2x (or more times) in the pipeline in case of a temporary failure but only one version will be committed to the external system.


But it’s mostly the processing that is interesting. Like this message I write now might see several retransmissions until you read it, but it’s a single message.


Yes but in practice this is not a problem because the bits that are impossible are so narrow that turning at-least-once into exactly-once is so easy it's a service offered by cloud vendors https://cloud.google.com/pubsub/docs/exactly-once-delivery


Those only work because they have retries built in at a layer that runs inside your service. You should understand this because it can have implications for the performance of your system during failures.


For example, you can have an exactly once implementation that is essentially implemented as at least once with repeated idempotent calls until a confirmation is received. Idempotency handling has a cost, confirmation reply has a cost, retry on call that didn’t have confirmation has a cost, etc.


My experience is building streaming systems using “exactly-once delivery” primitives is much more awkward than designing your system around at least once primitives and explicitly de-duplicating using last-write wins. For one thing, LWW gives you an obvious recovery strategy if you have outages of the primitive your system is built on: a lot of the exactly once modes for tools make failure recovery harder than necessary


"Redelivery versus duplicate" is doing quite a lot of work in there. This is an "at least once" delivery system providing building blocks that you can use to cope with the fact that it's physically impossible to prevent redelivery under some circumstances, which are not actually that rare because some of those circumstances are your fault, not Google's, etc.


It's often not a problem because it's often easy to make a call idempotent. Consider the case where you attach an ID to every event and the subscriber stores data in postgres. You stick everything in a transaction, persist each event's ID, add a unique index to that column, handle failure, and bang, it's now idempotent.


And if that service called an external system but failed before committing the transaction? I’m not sure you should be using db transactions in distributed systems as you can’t recover from partial failures.


It depends on what you're doing. On some projects we've put the job queue into our database, which works great at low and medium traffic volumes and means we can do exactly the same thing for external systems if they let us send them an ID.

I agree that the first port of call is to make your operation legitimately idempotent without passing IDs around, and the second is to ask if it is really important enough to care about delivery, but if you're not operating at ridiculous scale and you're okay with having a single point of failure then you get to avoid the "distributed systems are hard" rule by not actually having a completely distributed system.


at-least-once/exactly-once/at-most-once delivery are all weirdly named from the wrong perspective. From the sender's perspective there are only two options: send once and send lots. Behold:

- you send a message

- you receive nothing back

- what now

There is no algorithm that lets you implement exactly-once delivery in the face of delivery instability. Either you don't resend and you implemented at-most-once, or you resend and you implemented at-least-once.

You might say, "but hey, the receiver of course sends acks or checkpoints; I'm not a total buffoon". Sure. Let's game that out:

- you send message 8

- you get an ack for message 7

- you receive no more acks

- what now

Every system you'll use that says it implements exactly-once implements send lots and has some mechanism to coalesce (i.e. make idempotent) duplicate messages.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: