Reliable event dispatching using a transactional outbox

kgeist · on Feb 4, 2023

Oh, it took us around 2 years to have somewhat reliable event dispatching with transactional outboxes. We've run into so many edge cases under highload:

- using the autoincrement ID as a pointer to the "last processed event" not knowing that MySQL reserves IDs non-atomically with inserts so it's possible for the relay to "see" an event with a bigger ID before an event with a smaller ID becomes visible, thus skipping some events

- implementing event handlers with "only once" semantics instead of "at least once" (retries on error corrupting data)

- event queue processing completely halting for the entire application when an event handler throws an error due to a bug and so gets retried infinitely

- some other race conditions in the implementation when the order of events gets messed up

- too fine-grained events without batching overflowing the queue (takes too much time for an event to get processed)

- the relay getting OOMs due to excessive batching

- once we had a funny bug when code which updated the last processed ID of the current DB shard (each client has their own shard) wrote to the wrong shards and so our relay started replaying thousands events from years ago

- some event handlers always sending mail as part of processing events, so when it's retried on error or replayed (see the bug above) clients receive same emails multiple times

And still we have sometimes weird bugs like once a month for some reason we see a random event getting replayed in a deleted account, still tracking it down.

ryanjshaw · on Feb 4, 2023

Nice list, brings back fond memories.

On MS SQL you have a similar scenario IIRC - you can READ COMMITTED to block the query when there are dirty writes in the range of rows, but you're also encouraged in general to actually use READ COMMITTED SNAPSHOT, which won't block...

And then you get the problem where a different thread is inserting a message into the queue table, but it's taking a while, and the dispatcher is reading the queue table, and misses the uncommitted message... and never sees it again because you're using "last processed offset" as your "resume point", and you've moved past that point by the time the concurrent message insert completes. That's why I use "IsUnpublished = 1" as my flag, which also makes replay very simple to do (just create a filtered index on that column for performance reasons).

When I first started building applications like this, I eventually learned that the people discussing these patterns placed very little explicit emphasis on distinguishing between "application events" and "integration events" -- if you are struggling with too many events, maybe this is where your issue is.

A final point - your business logic event handler shouldn't send mail. You should produce an "operation completed (ID)" event and send the email off the back of that in an email notification event handler.

svieira · on Feb 4, 2023

This is interesting, but you've not actually solved the problem, just moved it. You still need cross-service transactions to publish the event only-once. Consider the case of "Publish the event to the queue. Fail to update / delete the entry in the event-buffer table." This is the "bad" pattern of push-then-store (with one exception - if you are fine with at-least-once message delivery instead of only-once). Likewise the "good" pattern of store-then-push has the same failure mode "Delete the entry from the buffer. Fail to publish the entry to the queue".

That said, this does decouple the two operations which allows you to scale the publish side of the service separately from produce side (which can help when your architecture can produce multiple messages per storage event)

ryanjshaw · on Feb 4, 2023

Maybe you are using different terminology, because IMO you don't need "cross-service transactions to publish the event only-once"

"Exactly-once" message processing is what you typically want in business operations, and that is straightforward to achieve with "at-least-once" message delivery:

  1. send(message with unique .id):
    a. insert into DB queue where not exists message.id

  2. dispatcher:
    a. fetch messages from DB queue where unpublished = 1
    b. publish message to integration queue [at-least-once-delivery]
    c. update message set unpublished = 0 (and ideally record some audit information e.g. the sent timestamp + Kafka message partition/offset or IBM MQ message ID)

  3. receive(message):
    a. check if message.id has been processed [exactly-once-processing]
    b. perform operation, including linking it back to message.id, in a single DB transaction

If failure occurs:

  1. between 2.a and 2.b, you fetch messages again and nothing is lost

  2. between 2.b and 2.c, you republish the same message; 3.a ensures exactly-once *processing*

  3. between 3.a and 3.b, you retry the message (with IBM MQ, you use transactions or peek then read;  with Kafka, you don't update offsets until 3.b completes)

  4. on the integration queue: set unpublished = 1 on all messages published sent in the timeframe where integration queue lost data

In practice: SQL DBs will surprise you with lock and concurrency semantics; I suggest using something battle tested under high load, or an event store database.

In a sense, this is a "cross-service transaction", but it is not a "transaction protocol", rather it is an eventually consistent design.

bastawhiz · on Feb 4, 2023

> You still need cross-service transactions to publish the event only-once. Consider the case of "Publish the event to the queue. Fail to update / delete the entry in the event-buffer table."

I'm not sure this is the problem the author has set out to solve. Exactly once dispatching is really hard and requires much more than what's written about here. Even though the system the author wrote about is at-least-once, it guarantees an event isn't dispatched for data that's not stored or that data is stored for an event that's never dispatched.

I do think the author is incorrect in one place:

> While consumers should ideally be idempotent

They must be idempotent. For this very reason.

bob1029 · on Feb 4, 2023

This reminds me of an old requirement we had to enforce printing of a specific document in our B2B product.

The feature had to get halfway to production before someone had the gall to ponder what might happen if a printer was out of paper (or worse).

Predicating business logic upon successful delivery of something over a "I don't give a fuck" wall is sketchy in my view.

tsss · on Feb 4, 2023

With Kafka's idempotent producer config, which is the default, this is a non-problem.

chucke · on Feb 4, 2023

On the topic, I created this ruby gem called tobox, essentially a transactional outbox framework: https://gitlab.com/os85/tobox

It actually circumvents most of the limitations mentioned in the article. Been successfully using it at work as an sns relay, for another app which is not even ruby.

groodt · on Feb 4, 2023

Any thoughts how to detect direct DML on the state table? Presumably allowing direct DML on the state table without the same on the Outbox table would lead to silent data corruption or lost updates.

mkleczek · on Feb 4, 2023

I must say not mentioning two phase commit and distributed transactions in this context seems strange.

It looks like nowadays people forgot about XA and that MQ and DBMS can participate in a distributed transaction.

ryanjshaw · on Feb 4, 2023

When I hit commercial software development in the late 00s, at least in the Microsoft space, distributed transactions were considered an anti-pattern because the complexity of it all >> the benefit. "Application-level" distributed transactions (i.e. asynchronous, eventually consistent, business-transaction messages with exactly-once processing semantics) are best practice at all the places I've worked in the past 15 years.