I honestly don't understand why this isn't a Postgres extension. In what case is...

twic · 2024-07-24T20:04:04 1721851444

> We saw that there were greater gains to be had than settling for a Postgres extension or stored procedures.

> Today, as we announce our Series A of $24 million

Can't raise a 24 million series A for a Postgres extension!

mihaic · 2024-07-24T20:06:47 1721851607

That's true, but who are the executives that buy this and then force developers to create a monstrous architecture that has all sorts of race conditions outside of the ledger?

jorangreef · 2024-07-24T20:58:22 1721854702

To be clear, TB moves the code to the data, rather than the data to the code, and precisely so that you don't have "race conditions outside the ledger".

Instead, all kinds of complicated debit/credit contracts (up to 8k financial transactions at a time, linked together atomically) can be expressed in a single request to the database, composed in terms of a rich set of debit/credit primitives (e.g. two-phase debit/credit with rollback after a timeout), to enforce financial consistency directly in the database.

On the other hand, moving the data to the code, to make decisions outside the OLTP database was exactly the anti-pattern we were wanting to fix in the central bank switch, as it tried to implement debit/credit primitives but over general-purpose DBMS. It's really hard to get these things right on top of Postgres.

And even if you get the primitives right, the performance is fundamentally limited by row locks interacting with RTTs and contention. Again, these row locks are not only external, but also internal (i.e. how I/O interacts with CPU inside the DBMS), and why stored procedures or extensions aren't enough to fix the performance.

RaftPeople · 2024-07-25T00:55:34 1721868934

Can you expand on why sproc isn't a good solution (e.g. send set of requests, process those that are still in valid state, error those that aren't, return responses)?

Maybe knowing the volumes you are dealing would help also.

jorangreef · 2024-07-25T09:21:43 1721899303

Sure! See: https://news.ycombinator.com/item?id=41061851

sakras · 2024-07-25T03:29:43 1721878183

TimescaleDB's $110M series C would like to have a word :)

jorangreef · 2024-07-24T20:43:40 1721853820

Hey mihaic! Thanks for the question.

> I honestly don't understand why this isn't a Postgres extension.

We considered a Postgres extension at the time (as well as stored procedures or even an embedded in-process DBMS).

However, this wouldn’t have moved the needle to where we needed it to be. Our internal design requirements (TB started as an internal project at Coil, contracting on a central bank switch) were literally a three order of magnitude increase in performance—to keep up with where transaction workloads were going.

While an extension or stored procedures would reduce external locking, the general-purpose DBMS design implementing them still tends to do far too much internal locking, interleaving disk I/O with CPU and coupling resources. In contrast, TigerBeetle explicitly decouples disk I/O and CPU to amortize internal locking and so “pipeline in bulk” for mechanical sympathy. Think SIMD vectorization but applied to state machine execution.

For example, before TB’s state machine executes 1 request of 8k transactions, all data dependencies are prefetched in advance (typically from L1/2/3 cache) so that the CPU becomes like a sprinter running the 100 meters. This suits extreme OLTP workloads where a few million debit/credit transactions need to be pushed through less than 10 accounts/rows (e.g. for a small central bank switch with 10 banks around the table). This is pathological for a general-purpose DBMS design, but easy for TB because hot accounts are hot in cache, and all locking (whether external or internal) is amortized across 8k transactions.

I spoke at QCon SF on this (https://www.youtube.com/watch?v=32LMicc0gRA) and matklad did two IronBeetle episodes walking through the code (https://www.youtube.com/watch?v=v5ThOoK3OFw&list=PL9eL-xg48O...).

But the big problem with extensions or stored procedures is that they still tend to have a “one transaction at a time” mindset at the network layer. In other words, they don’t typically amortize network requests beyond a 1:1 ratio of logical transaction to physical SQL transaction; they’re not ergonomic if you want to pack a few thousand logical transactions in one physical query.

On the other hand, TB’s design is like “stored procedures meets group commit on steroids”, packing up to 8k logical transactions in 1 physical query, and amortizing the costs not only of state machine execution (as described above) but also syscalls, networking and fsync (it’s something roughly like 4 syscalls, 4 memcopies and 4 network messages to execute 8k transactions—really hard for Postgres to match that).

Postgres is also nearly 30 years old. It's an awesome database but hardware, software and research into how you would design a transaction processing database today has advanced significantly since then. For example, we wanted more safety around things like Fsyncgate by having an explicit storage fault model. We also wanted deterministic simulation testing and static memory allocation, and to follow NASA's Power of Ten Rules for Safety-Critical code.

A Postgres extension would have been a showstopper for these things, but these were the technical contributions that needed to be made.

I also think that some of the most interesting performance innovations (static memory allocation, zero-deserialization, zero-context switches, zero-syscalls etc.) are coming out of HFT these days. For example, Martin Thompson’s Evolution of Financial Exchange Architectures: https://www.youtube.com/watch?v=qDhTjE0XmkE

HFT is a great precursor to see where OLTP is going, because the major contention problem of OLTP is mostly solved by HFT architectures, and because the arbitrage and volume of HFT is now moving into other sectors—as the world becomes more transactional.

> In what case is it better to have two databases?

Finally, regarding two databases, this was something we wanted explicit in the architecture. Not to "mix cash and customer records" in one general-purpose mutable filing cabinet, but rather to have "separation of concerns", the variable-length customer records in the general-purpose DBMS (or filing cabinet) in the control plane, and the cash in the immutable financial transactions database (or bank vault) in the data plane.

See also: https://docs.tigerbeetle.com/coding/system-architecture

It's the same reason you would want Postgres + S3, or Postgres + Redpanda. Postgres is perfect as a general-purpose or OLGP database, but it's not specialized for OLAP like DuckDB, or specialized for OLTP like TigerBeetle.

Again, appreciate the question and happy to answer more!

mihaic · 2024-07-24T21:02:18 1721854938

Thanks for taking the time for the explanation and the rundown on the architecture. Sounds a bit like an LMAX disruptor for DB, which honestly is quite a natural implementation of performance. Kudos for the Zig implementation as well, I've never seen a project as serious in it.

Personally, I still see challenges in developing on top of a system with data in two places unless there's a nice way to sync between them, and I would have seen the mutable/immutable classification as more of unlogged vs changes fully logged in DB, but I'm just doing armchair analysis here.

jorangreef · 2024-07-24T21:13:04 1721855584

Huge pleasure! :)

Exactly, the Martin Thompson talk I linked above is about the LMAX architecture. He gave this at QCon London I think in May 2020 and we were designing TigerBeetle in July 2020, pretty much lapping this up (I'd been a fan of Thompson's Mechanical Sympathy blog already for a few years by this point).

I think the way to see this is not as "two places for the same type of data" but rather as "separation of concerns for radically different types of data" with different compliance/retention/mutability/access/performance/scale characteristics.

It's also a natural architecture, and nothing new. How you would probably want to architect the "core" of a core banking system. We literally lifted the design for TigerBeetle directly out of the central bank switch's internal core, so that it would be dead simple to "heart transplant" back in later.

The surprising thing though, was when small fintech startups, energy and gaming companies started reaching out. The primitives are easy to build with and unlock significantly more scale. Again, like using object storage in addition to Postgres is probably a good idea.

travis86 · 2024-07-25T04:47:10 1721882830

You mention in your post that you apply model checking on the actual code. Have you posted something where you go into more detail on that technique?

jorangreef · 2024-07-25T09:32:07 1721899927

TB's simulator is called the VOPR, standing for “Viewstamped Operation Replicator” (and a tribute to War Games' WOPR).

You can read the code here: https://github.com/tigerbeetle/tigerbeetle/blob/f8a614644dcf...

This does things like:

- Abstract time (all timeouts etc.) in the DBMS, so that time can be accelerated (roughly by 700x) by ticking time in a while true loop.

- Abstract storage/network/process and do fault injection across all the storage/network/process fault models. You can read about these fault models here: https://docs.tigerbeetle.com/about/safety.

- Verify linearizability, but immediately as state machines advance state (not after the fact by checking for valid histories, which is more expensive), by comparing each state transition against the set of inflight client requests (the simulator controls the world so it can do this).

- But not only check correctness, also test liveness, that durability is not wasted, and that availability is maximized, given the durability at hand. In other words, given the amount of storage/network faults (or f) injected into the cluster, and according to the specification of the protocols (the simulator is protocol-aware), is the cluster as available as it should be? Or has it lost availability prematurely? See: https://tigerbeetle.com/blog/2023-07-06-simulation-testing-f...

- Then also do a myriad of things like verify that replicas are cache-coherent at all times with their simulated disk, that the page cache does not get out of sync (like what happened with Linux's page cache in Fsyncgate) etc.

- And while this is running, there are 6000+ assertions in all critical functions checking all pre/post-conditions at function (or block) scope.

See also matklad's “A Deterministic Walk Down TigerBeetle's Main Street”: https://www.youtube.com/watch?v=AGxAnkrhDGY

And please come and join us live every Thursday at 10am PT / 1pm ET / 5pm UTC for matklad's IronBeetle on Twitch where we do code walk throughs and live Q&A: https://www.twitch.tv/tigerbeetle