Pure Operation-Based Replicated Data Types

burntsushi · on Oct 18, 2017

I've been following the literature on this topic for a while now, and I've always wished for more examples of actual systems built with these techniques. Are there any? What kind of scales do they deal with? Are there any write-ups on them? (I know of at least one good one[1].)

In particular, I've wondered how a simpler solution to the problems posed in this paper stacks up. At what point does it fail? e.g., Consider:

    * An op based model.
    * A client-server architecture (not p2p).
    * Operations are assigned a total ordering by a central server.
    * Given any two dependent operations A and B, A is always
      transmitted before B.

This still gives you a lot of the stated advantages of an eventually consistent system, where each client communicating with the server will eventually converge even if they all temporarily diverge. The central server and total ordering is a key ingredient, because it's what lets you guarantee ordering between any two causal operations by having the server "choose" an ordering.

I'm naturally interested in trade offs. For what use cases does the central server model break down? Is it still useful for other things at lesser scales?

[1] - https://medium.com/@raphlinus/towards-a-unified-theory-of-op...

ryuuseijin · on Oct 18, 2017

What you describe looks a lot like a Google Wave like OT system. Wave-style OT is eventually consistent, like CRDTs, but you need a central server to give the event history a total order. This is necessary because Wave-style OT is a 1-1 model: clients are 1-1 connected with the server, but not with each other, which would be n-n, which is what CRDTs can do.

The total order of the central server can make the system simpler and more efficient, but by itself it doesn't solve the problem that Wave has, which is allowing a client to edit his text/message without being interrupted by network latency/interruptions -- imagine typing a letter and having to wait for the server to acknowledge that keypress with >100ms latencies. To solve this problem, you still need some form of xform/merge algorithm that OT and CRDT systems provide.

EDIT: I assumed you were not familiar with OT systems since you didn't mention it in your post, but now that I followed your link I can see that you are. In that light, it seems your comment is more a question about what the tradeoffs are between OT and CRDT systems rather than whether a central server can solve all problems without xform/merge logic.

One tradeoff that comes to mind when thinking about OT and CRDT systems is in the way operations track locations in the datastructure. In OT systems you have offsets (small), in CRDTs you have uuids (large) or dynamically growing identifiers (usually small but possibly large). This has implications for the byte-size of operations or the in-memory datastructure.

Another is that CRDTs have a pruning problem. It has been some time since I looked at CRDTs, but I remember that Wave-style OT didn't have the same problem due to the central server. The pruning problem can cause a CRDT to grow larger than it needs to by forcing it to keep more historic data around just in case it gets an old operation it hasn't seen yet. The central server solves this problem by guaranteeing that it will have sent you all old operations before sending you a newer one. If you know all actors in a n-n system you can also solve this issue, but in an unbounded n-n system I didn't see any way this issue can be solved when I was researching it.

EDIT2: Just want to add that there are lots of other problems that are more practical than theoretical. For example, authorization, authorative copy of the data, REST API, things like that, but that would depend more on the exact use case.

zzzcpan · on Oct 18, 2017

With CRDTs log entries can be purged once they synchronize with everyone, no need to keep them just in case. Although it's rather implementation specific.

ryuuseijin · on Oct 18, 2017

Thanks, I did address that when I said you can solve it if you know all actors in a n-n system, but I should have been clearer by pointing out the solution, which is (as you already said) every known actor acknowledging it.

In an unbounded n-n system I still don't see a solution.

burntsushi · on Oct 18, 2017

Thanks for this! I think I'm still digesting, but I do find the contrast you're drawing between OT and CRDTs interesting. I don't think I had ever seen the fundamental difference between them to be the network topology, but I can't quite claim to understand either in sufficient depth.

ryuuseijin · on Oct 19, 2017

Btw, Wave-style OT needs a central server, but there are other forms of OT that do not. They mention this in your link. If your OT system can satisfy the TP2 propery you can do n-n synchronization but still have what you'd call an OT system.

teraflop · on Oct 18, 2017

Check out Bayou (it's cited in the references of this paper, but not discussed). It's basically what you're describing: operations are initially assigned a temporary timestamp and distributed through gossip, and when they reach a central server, they're given a permanent monotonic sequence number that allows old entries to be garbage-collected.

archagon · on Oct 18, 2017

I think there are few systems built like this because we're living in a decidedly centralized world. Also, CRDTs involve many tradeoffs (limited # of primitive types, complex performance characteristics, difficult algorithms, garbage collection) which make them less a swap-in option than a complete paradigm shift. Still, there are many situations where building on CRDTs would make your software far more resilient. For instance: let's say you want to design a document-based app that a) supports real-time collaboration, b) supports sync to a central server, c) works offline for arbitrary periods of time, d) supports real-time sync between offline and/or local devices. If you're relying on a central server to give you a total order, this would be very hard to do. With a CRDT-based system, it's effectively built-in. Doesn't matter if your peers are local devices, ad-hoc peers, a central server, or even a file on a USB stick: everything will still merge in the end, as long as you can distribute your ops in causal order. (And if you're using a state-based CvRDT, even that doesn't matter. Just send your file and it will merge.)

At the limit, this means not having to think about the consistency of your data in terms of the network.

burntsushi · on Oct 18, 2017

> a) supports real-time collaboration, b) supports sync to a central server, c) works offline for arbitrary periods of time, d) supports real-time sync between offline and/or local devices. If you're relying on a central server to give you a total order, this would be very hard to do.

I guess my question is: why is this hard to do? I have another comment[1] in this thread that gives a bit more of a detailed example.

Others in this thread also seem to be implying that my stated design is similar to how Google Wave worked. From Raph's article that I linked my initial comment:

> Almost all practical implementations of OT forego TP2, and solve the problem by limiting concurrency in some way, generally requiring a central server to decide on a canonical ordering of the operations (but still using transformations to let clients apply operations out-of-order just a little bit).

Maybe the parenthetical bit about transformations is the part I'm leaving out that I need to make more explicit? Not sure.

[1] - https://news.ycombinator.com/item?id=15498216

archagon · on Oct 18, 2017

If your design is resilient enough to work between arbitrary peers, then you've essentially reinvented OT or a CRDT and your central server is vestigial, no? Basically, if peers are collaborating amongst themselves while disconnected from the central server (as in the case of offline devices working on the same file), they have to figure out how to resolve conflicts between each other anyway. CRDTs simply make this process deterministic.

Also, it should be noted that Google's OT does not actually let you work offline for very long. Hence, "apply operations out-of-order just a little bit".

As for your example:

There are two things that algorithms for eventual consistency have to deal with. First, causality. This is your add/remove vertex example. Obviously, a vertex can't be removed before it's added. Any "remove" has to causally follow the "add". Causality, however, is usually dealt with on the transport layer via causal delivery, i.e. it's not really part of the CRDT. The "hard" CRDT/OT case is a set of concurrent operations without any causal order.

In terms of your example, you picked a data structure which is essentially a pair of sets — an easy case. Consider instead an array. Array operations that are not commutative are not necessarily dependent, such as three peers simultaneously adding an item to the same index. There is no causal "order in which they were applied". CRDTs allow you to tiebreak this case, often by deterministically ordering concurrent operations via their causal timestamp and owner UUID. These bits of info are not really part of the data, but they're still necessary to derive a total order. Thus, they tend to be included as part of each CRDT op, ballooning its size. I believe what the paper proposes is separating out this metadata from the operations, then clearing it out to save space when it's no longer necessary. (That is, when all clients have merged past the spot of contention.)

And yes, a central server could perform this tiebreaking step. But in the meantime, the peers talking amongst themselves could be building on their locally-resolved conflicts and find themselves rudely interrupted when the server decides that no, the "truth" they've been assuming for the past minute/hour/day/month is actually the wrong one. (Remember, our goal is arbitrarily long periods of local/P2P/offline communication.)

burntsushi · on Oct 18, 2017

Interesting, thanks, this does help clarify things for me. I just wanted to post a few follow up clarifications:

    1. In my world, the clients don't communicate with each
       other. Everything has to bounce off the central server
       first.
    2. And yeah, a graph is the easy case. I definitely agree
       that my design doesn't address sequences. If your application
       doesn't require sequences, though, then perhaps it's
       possible to get away with this simpler solution?

archagon · on Oct 19, 2017

1. Yeah, that's why I mentioned "a document-based app that a) supports real-time collaboration, b) supports sync to a central server, c) works offline for arbitrary periods of time, d) supports real-time sync between offline and/or local devices". It's those points c) and d) that benefit most from the CRDT model. Point a) has of course been proven to work well with a central server, but not usually together with c) and d). It's very hard to handle all four cases if you're relying on a single source of truth.

2. Sure. Actually, if your design happens to be commutative for every operation after causal delivery, then you've invented what the paper calls a "pure op-based CRDT"! That means it'll work with a server or even through P2P (again, w/causal delivery) just by sending the operations around — no extra information required. I believe your vertex/edge example qualifies as long as each new object is created with a unique ID. However, if there's even a remote chance that two objects could be created with the same ID, then things fall apart, since you could have three concurrent operations of DEL-A, INS-A (the new one), and DEL-A, assuming a starting set that includes A. Depending on the order, you could end up with the set including or excluding A. This is why there are multiple kinds of CRDT sets and why they're not "pure op-based", e.g. Add-Wins and Remove-Wins as mentioned in the paper. I think you'll find that most data structures that are remotely interesting can't really be simplified down to the "pure op-based CRDT".

PS, I made a mistake in the last comment, at least in terms of intent: the server won't be able to tiebreak that concurrency case since the ops have already been applied locally to all the clients and they don't commute. Instead, the server will be forced to either transform the operations (as in Operational Transformation) and then send them off (thus making them de facto commute), or alternatively panic, rebase (replaying part of its history), and force all clients to resync. That is, assuming the clients don't keep around the operation history themselves; if they do, they could perform the rebase/replay locally and the operations would (technically) commute. And this would even work for arbitrary operation-based data structures, including sequences or anything else. But then you have to make sure that replaying history has reasonable complexity, e.g. not O(N^2), or you'll be waiting hours if a concurrent commit resolves at the beginning of your operation stack. Ugh, this stuff is confusing...

burntsushi · on Oct 19, 2017

> I believe your vertex/edge example qualifies as long as each new object is created with a unique ID.

Aye, that is indeed the case. (Technically, clients could misbehave/be buggy and insert whatever ID they like, but I think I can live with that. Otherwise, ID generation is handled in a way that guarantees uniqueness.)

> That is, assuming the clients don't keep around the operation history themselves; if they do, they could perform the rebase/replay locally and the operations would (technically) commute. And this would even work for arbitrary operation-based data structures, including sequences or anything else. But then you have to make sure that replaying history has reasonable complexity, e.g. not O(N^2), or you'll be waiting hours if a concurrent commit resolves at the beginning of your operation stack.

I came to this same conclusion as well, and this is exactly what the system does. (Notice that I've slyly gone from "here's this hypothetical design" to "I've actually built something similar to this design." :P) Persistent data structures help a lot here.

I do expect to run into scaling problems with this design, but I think there are many buttons to push (of different kinds) to help with that. In particular, while it is not a document sharing service like Google Docs, it does have the same benefit where each operation belongs to one journal, and one journal makes up one graph, but there are many graphs in the system editable by users. Regardless, I'm really happy I provoked this discussion because I learned a lot!

naasking · on Oct 18, 2017

Consider a version control system where you had to communicate with the server for every commit, branch, merge, etc. That's how it used to be, and it was painful.

Distributed version control relaxes those constraints with better merging to ensure commits are commutative. Now take that property and extend it to general programming, and you have similar advantages with eventual consistency.

It's harder to see with typical concurrent abstractions, but Microsoft's concurrent revisions explicitly models their concurrency abstractions around distributed version control. Their papers are worth a read if you want ideas, data and inspiration that concurrency and distributed programming can actually be easy.

burntsushi · on Oct 18, 2017

I think I've done a bad job at communicating my design because it's perfectly consistent with what you've said. My proposed design is a client/server model, but that doesn't mean it isn't distributed. Each client can still operate on its own for as long as it wants, and it chooses to synchronize with the central server whenever it likes. (In fact, this is quite similar to most peoples' workflows with Git/Github, and I certainly wouldn't call that painful.)

In any case, thanks for the pointers about the Microsoft research. I don't think I've seen that. I'll check it out.

naasking · on Oct 18, 2017

See here for a comprehensive review of CR: http://lambda-the-ultimate.org/node/4699

Each client can operate on its own what though? If they have some kind of private copy of the data, then how do you resolve conflicts? CRDTs are about doing this automatically by restricting the types of changes clients can make. Concurrent revisions permit arbitrary changes so long as you provide a merge function to make them commute.

I think you're just hand-waving a lot of the complexity away by using "synchronize" without explaining what is synchronized and how. That's where all the complexity in distributed computation stems from.

burntsushi · on Oct 18, 2017

> Each client can operate on its own what though?

Imagine the fundamental representation of your data structure is a sequence of operations. From those operations, you can generate your model. A really simple application of this is a graph, which is just a set of vertices and edges. So if you have these operations

    VertexAdd { id: fooID, label: foo }
    VertexAdd { id: barID, label: bar }
    VertexAdd { id: quuxID, label: quux }
    EdgeAdd { src: fooID, dst: quuxID }

where the ids generated are unique (say, uuids). The corresponding model is just the graph:

    Graph {
        vertices: {foo, bar, quux},
        edges: {{foo, bar}},
    }

You might imagine operations to remove vertices and edges as well. Many operations are commutative (add vertex), but some pairs of operations don't commute, e.g., add vertex/remove vertex. As I understand that, that's the central problem this paper is trying to address via CRDTs. What I'm saying is that the total ordering imposed by the server makes this concern moot, because the server always picks one sequential ordering.

Let's say there are multiple simultaneous clients editing our graph structure above at the same time. Let's consider an example of a pair of operations that commute. One of them decides to remove vertex quux while the other adds an edge that connects quux to foo:

    client1:
        VertexDelete { id: quuxID }
    
    client2:
        EdgeAdd { src: quuxID, dst: fooID }

If both clients attempt to update the server with these new operations, then the server will declare one ordering as true and send the resulting operations back to the clients with the correct ordering. The clients must then adjust. In this case, the outcome is the same regardless of the ordering chosen (because you can't draw an edge to a vertex that doesn't exist).

What about the case when two operations are not commutative? Well, that means they are dependent, which in turn means that every client receives them in the order in which they were applied. An add/delete vertex is one such example, because the client doing the delete must acquire a vertex ID to give to the VertexDelete operation, but the only way to acquire the ID is to observer the VertexAdd first.

Another example of dependent operations is attaching meta data to vertices, e.g.,

    client1:
        VertexAdd { id: anotherID, label: Another }
        VertexLabel { id: anotherID, label: Changed }

In this case, VertexAdd/VertexLabel are dependent, but the Label operation requires the vertex ID.

Now of course, clients can misbehave. There's no reason why a client couldn't generate the ID and add operations in this order:

    client1:
        VertexLabel { id: anotherID, label: Changed }
        VertexAdd { id: anotherID, label: Another }

But if we can assume well behaving clients, then this should be OK! And even if clients are misbehaving, then the process of turning the operations into a model can simply drop operations like VertexLabel because they reference a vertex that doesn't exist.

naasking · on Oct 18, 2017

> If both clients attempt to update the server with these new operations, then the server will declare one ordering as true and send the resulting operations back to the clients with the correct ordering. The clients must then adjust.

But this seems no different than "last write wins". The point is that clients want to keep reliably working on their copy without suddenly losing all of their work because of an incorrect/stale assumption that their changes would be accepted. This is typically considered undesirable.

So with non-commutative changes, either you would sync with the server after each change to ensure you can make meaningful progress, or you risk doing lots of meaningless work and having it all undone.

burntsushi · on Oct 18, 2017

Yes, but that's natural conflict. One client can do something that the other client doesn't like, and they can go back and forth. The system can't prevent that because it's a natural part of collaboration.

And note that this only applies to the interactions between local operations and server operations. If a client does a bunch of local work and only a small component of it is tied to a previous operation (e.g., one edge in graph), then that doesn't mean the rest of the work gets dropped. That is, if I create a bunch of new vertexes and edges, and only one of those edges connects my new vertex to a vertex already present in the server's operations and that connection is deleted simultaneously by another client, then I don't lose all of my work---I just lose that one edge. The rest of my graph is in tact.

And of course, if this is an end user application (like, say, Google Docs), then the user should be able to view the history and undo what their collaborator did. But I don't see this as any different than editing a document on Google Docs where one collaborator keeps deleting stuff you did.

naasking · on Oct 18, 2017

> Yes, but that's natural conflict. One client can do something that the other client doesn't like, and they can go back and forth. The system can't prevent that because it's a natural part of collaboration.

But this isn't necessarily true. The point is there's more concurrency and collaboration possible than what you describe, if you restrict the operations in specific ways. This is what CRDTs and CRs do.

> But I don't see this as any different than editing a document on Google Docs where one collaborator keeps deleting stuff you did.

It is different. The collaborator actually saw your changes and decided to remove them. This is very different than your changes arbitrarily being lost by the system itself because its merge behaviour is insufficient.

burntsushi · on Oct 18, 2017

Hmm, OK, I guess I'm kind of lost then. In my mind, the system I described never loses anything. I'm OK nipping this convo because I kind of feel like we're going in circles, but if you do want to continue I'm not sure how because I don't quite understand where we're misunderstanding each other. Maybe show an explicit example of the kind of data loss you're talking about?

My best guess is that I've described my design poorly and I'm leaving something critical out. I think the only way for me to figure out what I've left out is to hear more of your objection. :-)

Karrot_Kream · on Oct 18, 2017

Not sure if I can respond properly on mobile but I'll do my best.

1. There are differences between system conflicts and user conflicts.

Let's say I'm collaborating on a Google doc. Both my collaborator and I start writing in the same part of the document. If the concurrent system is designed "poorly" (for example, it has last write wins semantics), then only one of our writing appears. You can store this history in the system and show that your changes were overwritten by their changes, but this can be frustrating if you are both collaborating on the end of the document.

2. Commutative operations give better guarantees.

Let's take for instance an append only DAG. We restrict this DAG to having 2 operations: creating a vertex (and associated edges) between 2 existing vertices and creating an edge between 2 existing vertices. In this case, even if both my collaborator and I add a vertex between 2 existing vertices, the system will converge on a DAG where both vertices are added. You can follow the same argument with adding edges.

This DAG is an example of a CRDT. It has a limited set of operations (addEdge and addBetween) which allow convergent semantics through concurrent operations

Happy to chat more about this stuff and hope my mobile response makes sense heh.

burntsushi · on Oct 18, 2017

So w.r.t. to (1), yeah, I totally get that. And I agree that is a hard problem to solve in the context of document editing (or more precisely, anything that contains a sequence). But for a graph, the logic is considerably simpler because a graph is just two sets. So if the system "loses" work that the client has done, it's only when some other client has deleted or changed something. So if two clients are editing the same graph and one drops a vertex and the other creates an edge that connects to that vertex, what is the right outcome in that circumstance? It seems to me like the edge gets deleted, and that this is a natural user conflict, no? (Regardless of how you order the operations in this case, the edge gets dropped.)

W.r.t. to (2), yeah, sure, but if clients need to be able to remove things then you lose commutativity and you end up missing the requirements of the system itself. Or is there some other component of a CRDT system that's supposed to handle this? (My big picture takeaway from the OP paper was that it was specifically trying to handle the non-commutative operations.)

Karrot_Kream · on Oct 18, 2017

Yup! Removes are where a lot of these things get tricky. The typical strategy is as you say, removes take precedence over adds, so if one client adds vertices/edges onto a vertex that gets removed by another client, when the DAGs converge, the remove gets precedence and "drops" all edges/vertices attached. It seems like we're basically saying the same thing.

Removes are also tricky because if you're just transmitting the remove operation itself, you have to ensure that updates appear to the client in order, or else removes and adds could conflict, causing inconsistent state.

burntsushi · on Oct 18, 2017

Yeah, in practice, each client maintains a persistent data structure, where each "mutation" is the application of a single operation. If a new operation comes in that occurs before other operations that have already been applied to the persistent data structure, then the client rewinds the structure to its state just before the new out-of-order operation, applies that operation, and then applies the rest of the operations that it already has. (I think of this as very similar to a `git rebase`.)

But yeah, this might not work at all for sequences. And of course, there are other issues with respect to the memory requirements of the persistent data structure in the client. Some sort of compaction is probably necessary.

zzzcpan · on Oct 18, 2017

Central server for ordering would require you to always wait for its decision on every operation, which adds latency, limits throughput and reduces availability, not to mention how much complexity it adds, compared to something as simple as vector clocks.

burntsushi · on Oct 18, 2017

I don't see how you need to wait. It's still an eventually consistent system. Any client can build up their own replica of operations and synchronize with the central server at any time.

I'm not sure I see the additional complexity. With a central server, you get to enforce a total ordering automatically across all operations.

In any case, what I'm looking for is actual data. I implicitly understand that a central server won't scale as well, but that's not particularly interesting to me. For example, what if my requirements on throughput are very low? I want to know the breaking point. So I guess I'm looking for an experience report from someone who has actually gone through this process.

zzzcpan · on Oct 18, 2017

Operations require sort of versions to synchronize, these versions is where ordering comes from too. This is important because since networks are asynchronous messages can arrive to this central server at different order. So, the way to achieve order with central server and being able to synchronize operations is to ask that server and wait for its decision. Not really that simple nor even usable. Instead the simplest and the most naive way to do this is to use timestamps for versions, still without a central server though. But time is not really possible to synchronize, so instead logical clocks should be used, like lamport timestamps, vector clocks.

(And yes, I've been through this)

burntsushi · on Oct 18, 2017

If you have a single logical server, then it chooses the ordering, but a single client doesn't need to ask the server for the ordering of any operations generated locally: it can pick the order because it determines the relative ordering of those operations.

A client could be disconnected from the network for hours and re-synchronize at some later point in time and it all would work just fine.

The only time the client needs to connect to the server are when the client wants to publish updates, or when the client wants to receive the updates. If you only need to do that once every 10 seconds (or even once every second), does that make a client/server model workable?

zzzcpan · on Oct 18, 2017

Ok, let's think: if a single client only has ordering locally and the server assigns global ordering, than retransmit in case a client didn't receive an acknowledgement from the server is going to assign global ordering again, like it's a new batch of data. Which breaks consistency. If we want to keep consistency and be able to synchronize we need to have this global ordering on the client.

To make it work that way a client has to ask the server for a starting point before it can generate operations and increment it with each new operation. But then it doesn't sound like the server does the ordering and starts to look a lot like lamport timestamps.

burntsushi · on Oct 18, 2017

Maybe think about this way: a client has two sets of operations. One set reflects what it has received from the server and the other set reflects the operations that the client itself has generated, but has not yet received confirmation from the server that they have been assimilated into a global ordering (or perhaps the client simply hasn't sent them yet). We also have a guarantee that if a client sends a batch of operations, then the server will not reorder those operations relative to one another.

For the client, the only real choice they have (until they hear otherwise) is to assume that all local operations come after all server operations. At some point, the server will send the client updates with new operations from other clients, or perhaps even confirmation that some of their previously local operations have now been recorded by the server and assigned a total ordering. The client is then responsible for updating its local state: moving the local operations into the server operation list and updating the server operation list to contain any additional operations reported by the server. The client must then re-materialize any data structures dependent on these operations, but depending on the nature of your operations, this might be possible to do efficiently.

Given enough time where every client is idle, its set of local operations will become empty as they all move into the server's set of operations. All clients then have the same set of operations in the same order, which gives you consistency.

zzzcpan · on Oct 18, 2017

Well, communications are still unreliable and you really have to use proper distributed algorithms to solve that if you want to keep consistency, there is no way around it.

burntsushi · on Oct 18, 2017

It seems like a pretty simple solution works in this particular design space: the client keeps asking for updates and keeps pushing updates to the central server.

zzzcpan · on Oct 18, 2017

I think we are talking past each other. It is simple to synchronize things, very simple, but only without centralized coordination. In distributed systems any coordination complicates things a lot and I don't think it's even possible to make simpler solution with coordination if there exists one without it.

ryuuseijin · on Oct 18, 2017

Google wave is an example of what works well with a central server: you have many documents that can be sharded across many instances because each document only needs to be consistent with itself, and your throughput limit is therefore only per-document which is more likely to be bounded by the number of participants that can practically interact in a single document.

There is still the issue of a node becoming hot because there is an unusually active document on that node, which would usually not happen in a completely decentralized model.

burntsushi · on Oct 18, 2017

This is a really good point, thanks!

pjc50 · on Oct 18, 2017

Normally we talk about these things in terms of CAP: consistency, availability, and partition tolerance.

Systems with a central server avoid problems of partition tolerance and consistency just fine. It's the availability that's a problem. The system cannot operate without the central server. Whether this is a problem in your use case depends on your use case. Nationwide financial system? Aerospace control system? You probably care a lot about availability and can't use the central server model.

(Although ironically a large part of how stock exchange clearing works is imposing an order on transactions and reconciling the result at the end of the day)

burntsushi · on Oct 18, 2017

Yeah that is a good take! Thanks. Do you have any links to (relatively?) accessible reading on how stock exchange clearing works? That sounds like an interesting application of exactly this sort of problem that I'd like to learn more about.

pjc50 · on Oct 18, 2017

You could have a look at https://www.lseg.com/areas-expertise/our-markets/turquoise/e... although I don't know if I'd call it "accessible".

"Clearing" in general is a pre-digital technique for resolving concurrency. Someone writes a cheque for payment and puts a date on it, then exchanges it for goods; at some point later, it reaches the single, central ledger of that customer's bank account which is updated to reflect the transaction.

I'll also reccommend https://martinfowler.com/articles/lmax.html , because it's an interesting architecture with a similar approach of just routing all the operations through in series, but with a number of producer/consumers attached to a ring buffer.

runT1ME · on Oct 18, 2017

You need a server, which is a single point of failure. If you want redundancy, you now need a quorum in the event of a split brain. CRDTs have the advantage of not needing a quorum in event of a split brain. There is no way to 'lose' information. In addition to being resilient to network split brain, this allows you to make performance gains at the sacrifice of consistency, where you could buffer your updates to other nodes.

Operation based CRDTs do seem more limiting compared to State Based though, so you're at the mercy of your data structure being able to be represented as a CRDT.

burntsushi · on Oct 18, 2017

Sure, but let's say I was OK with a single point of failure. Or let's be more specific and say my central server is a PostgeSQL instance.

> In addition to being resilient to network split brain, this allows you to make performance gains at the sacrifice of consistency, where you could buffer your updates to other nodes.

My scheme allows this too, no?

runT1ME · on Oct 18, 2017

> Sure, but let's say I was OK with a single point of failure.

I cannot imagine designing a system for production use where anyone is OK with a single point of failure being a single instance of a machine. I mean if we're playing hypothetical games you don't need a distributed system either, just get one giant instance and put everything on it.

burntsushi · on Oct 18, 2017

I'm trying to distill the design down to its essential components, and then figure out when it will break. If you're going to stick on this point, then let's imagine I have a PostgreSQL setup with replication, such that if any one fails, I can fall back to another. I primarily want to put this in the context of what problems folks are solving with operational CRDTs to get a sense of the scale that requires them.

> I mean if we're playing hypothetical games you don't need a distributed system either, just get one giant instance and put everything on it.

I guess the requirements need to be more explicitly stated. You need to at least be able to support multiple simultaneous clients interacting with the system, where any client could disconnect for some period of time.

holoway · on Oct 18, 2017

Habitat uses some simple data structures that are basically CRDTs. We used them because they made the user experience better (not needing infrastructure other than the gossip itself). One thing I learned early on that journey was that the more specific your use case, the easier a fully distributed model is. As soon as you need generic primitives a-la etcd or zookeeper, everything gets orders of magnitude harder.

thdxr · on Oct 18, 2017

What happens in your scenario if operation B arrives hours later? It's hard to provide reliable ordering in practice because networks aren't dependable and things that happened earlier may show up later.

I work on a realtime database and we heavily rely on CRDTs for communication. We can't have a central server because our goals are fault tolerance and high throughout.

If you're more curious about the subject look into Riak. Afaik that is the biggest project built with CRDTs

burntsushi · on Oct 18, 2017

> What happens in your scenario if operation B arrives hours later? It's hard to provide reliable ordering in practice because networks aren't dependable and things that happened earlier may show up later.

I guess I'm trying to figure out why this is a problem. If operation B shows up hours later, then the client simply doesn't know about B until then. But since my stated design requires that dependent operations are delivered in order, we have that any other operations are necessarily concurrent with B, so it shouldn't matter when B is "evaluated."

> I work on a realtime database and we heavily rely on CRDTs for communication. We can't have a central server because our goals are fault tolerance and high throughout.

> If you're more curious about the subject look into Riak. Afaik that is the biggest project built with CRDTs

Yeah, I'm definitely aware of Riak. I guess I'm more interested in the systems layered on top of these components. i.e., What problems are they trying to solve?

For instance, what if you could relax your requirements? Maybe: throughput/latency only need to be good enough for web response time to end users.

im_down_w_otp · on Oct 18, 2017

For the most part CRDTs of all types solve primarily one problem. The "lost update" problem.

When I worked with CRDTs---even inventing a new one based on R*-trees---they were most useful for objects/structures that needed constant mutating by lots and lots of participants concurrently where it made sense to trade constant coordination/protection overhead for lazy-evaluation/merging. Like in a settlement ledger, or massive leaderboard, or indexing structure, etc.

burntsushi · on Oct 18, 2017

Interesting, thanks for sharing that!

marknadal · on Oct 18, 2017

@thdxr, which realtime database do you work on? I know some people at Riak/Basho who are now at MIT, they are helping us formalize our CRDT ( https://github.com/amark/gun/wiki/Conflict-Resolution-with-G... ). We were chatting just last night and he said that Riak actually lacked proper CRDT in many settings (for example, secondary indices would produce non-deterministic results depending on order).

acjohnson55 · on Oct 18, 2017

We use it for pretty basic purposes -- just replicating emphemeral data across our cluster, for debug settings, flash messages, etc. If the whole cluster goes down, no big deal.

archagon · on Oct 18, 2017

I'm not an expert or an academic, but I've been spending a lot of time with CRDTs lately. It seems to me that there are two major issues with this approach.

First, little is said about performance. As the paper explains, the meat of each CRDT is pushed into the eval function, which to my understanding is simply a function over the PO-Log. However, is it always possible to adapt a convergent data structure in such a way that eval takes a reasonable amount of time? I notice that sequences — perhaps the most important data type in CRDT-land! — have not been implemented using this approach. If we assume that a sequence can be retrieved from an insert/delete PO-Log by simply sorting it and removing the deleted operations, does that mean that every eval is O(Nlog(N)) at best? And if your solution is to cache the output sequence as an array, a) how do you ensure a correct mapping between the PO-Log and cache on every new operation, and b) what happens if you lose your data and need to replay your PO-Log from scratch? Can the cache be reconstructed O(Nlog(N)) at worst? A complete guess, but maybe having CRDT-specific bits in the prepare/effect steps is what actually allows CRDTs to be performant in the first place! PO-Log representation is alluringly flexible but seems to come with some hefty tradeoffs.

Second, one of the cleanup steps relies on causal stability, i.e. knowing that each client is ahead of a potentially concurrent op. This is a problem in pure, decentralized P2P environments. First, depending on your network architecture, it's not necessarily possible to identify each peer until they actually start sending messages around. Maybe they got their hands on an early revision of the data and have been chipping away for weeks before going online. Second, nothing prevents a peer from connecting for a bit and then leaving forever, thus ensuring that their last edits will never be causally stable. This can be solved with some centralized logic, but then what's the point of using a CRDT at all?

Finally, and less critically, the lossy cleanup steps make it impossible to retrieve old revisions or identify the author of a particular change.

thdxr · on Oct 18, 2017

This is the exact topic I've been trying to think on for the past month. Excited to dig through it.