Jepsen: YugaByte DB 1.1.9

the_duke · on March 26, 2019

First time I heard about them.

Seems to be another distributed SQL (aka 'newsql') alternative to TiDB and CockroachDB.

Based on RocksDB (like Cockroach) with a custom distributed key/val layer and and additional SQL layer on top. PostgreSQL protocol compatible.

OS with Apache license.

Seems interesting. (when ignoring the "planet scale SQL" marketing speak... [1])

[1] https://www.yugabyte.com/planet-scale-sql/

curun1r · on March 26, 2019

We had a sales call with them last year and got to speak to one of their devs. The impression they gave was that it was mostly ex-Facebook data guys that left to start a company based on the work they did on Cassandra and a few other internal projects.

The really interesting feature, to us, was the promise that once the Postgres-compatible layer was complete, we could use whatever semantics were appropriate for our business use case while using the same logical database cluster. We could use the Redis interface for persistent caching, the CQL interface for our NoSQL-appropriate use cases and the Postgres interface for our more traditional use cases. And the client libraries for all those interfaces are the same ones we already use to talk to Redis and Postgres (our conversation happened because we were starting a project that was more NoSQL-appropriate, so we weren’t using Cassandra yet), so very little of our code would have to change.

manigandham · on March 26, 2019

They're a good team. The same data isn't available across different interfaces but there is definitely value in having a single core system that serves multiple apps and models.

marknadal · on March 26, 2019

As an expert in the DB space, I'm extraordinarily cynical.

But their willingness to license it as truly open, Apache-style, instantly is a big win-over.

I'm a competitor, but I can tell these guys/gals are genuine in their efforts.

We need more people, teams, and DBs like YugaByteDB in the world.

Thank you for your efforts.

the_duke · on March 26, 2019

The question is, how many of these very similar databases can the market support?

The field is getting crowded, and the database market is already quite competitive as it is, without these new competitors.

There just are not that many use cases where a larger Postgres/MySQL instance with one or two replicas is insufficient.

From a user perspective, I'd much rather have one or two successful companies where I can be reasonably certain that the product will be maintained in 5 years than too much competition.

DigitalVerse · on March 26, 2019

I do think there are going to be more and more use cases where having more than one or two replicas is necessary, even if there aren't that many right now. IoT strikes me as an example. But even so, I'm with you. I'd much rather have two companies that I know will maintain the software long term than a plethora of competitors that might die out at any time. And while being open source is helpful in that regard, there's no guarantee that the open source community will maintain it after the backing company is gone. Databases are too important to gamble on imo.

On the other hand, a little competition to make sure the big guys stay on their toes is a good thing.

marknadal · on March 26, 2019

This is why the license matters.

Even if the commercial/AGPL/GPL ones "win" for a few years, they won't be able to compete with these more Open licensed DBs when they catch up.

So at any given moment, too many DBs may be annoying, but for the long term and long game it is important there is this type of competition & research going on.

Although, I absolutely agree, when it comes to Master-Slave based systems (I've been very vocal in criticizing them) that market is drying up to some very limited use cases (banking, etc.). 99%+ of use cases will be Strong Eventual Consistency and CRDT with distributed or decentralized/P2P tools.

Some really old, yet very relevant, thoughts on this subject:

https://hackernoon.com/the-implications-of-rethinkdb-and-par...

jimktrains2 · on March 26, 2019

I'm thourghly confused by your association of (a)GPL and commercial and calling apach/bad licenses more open. AGPL ensures that users always retain the 4 freedoms, by restricting developers. BSD allows developers to do whatever, including restricting the users. Neither is "more open", they both make trade offs and neither is comparable to proprietary except to say the BSD style licences allow for it if the developer chooses.

Look at the kurfuffle around mongo, redis, and elastic search because of their licenses. However, you don't hear the same issues coming from the postgres community. The licenses you're claiming will win the day cause problems for for-profit I companies, for exactly the reason you think they're "more open".

In the end, either entrenched proprietary software or open, community-focused, community-stewarded software will win the day.

marknadal · on March 26, 2019

I believe we both have reasonable arguments from our paradigm, it is just the paradigms have conflicting definitions.

When people who share camp with me say "Open" or "Freedom" we mean Free Speech AND Free Beer.

Where the disagreement happens is on Free Speech:

There are many people/governments that define Free Speech as "Free Speech as long as someone does not shout 'fire' in a crowded room." This is the spirit of (a)GPL in restricting people.

The other group defines Free Speech and/or "Freedom" as "without restriction". Not because they want people to yell "fire" but because they attribute restriction/regulation as the mechanism towards monopoly & centralization. Not that regulation/restriction on its own is bad (every individual ought exercise self-discipline), but it is particularly dangerous once monopoly & centralization emerges because it produces totalitarian or fascist structures.

To counter my own view, many people in the camp opposite of me, have expressed same end-goal concerns "we want to restrict hate speech so fascism doesn't rise". I think it is admirable we have shared-goals (stopping totalitarianism), but for reasons you probably don't share, I think it is more effective to stop fascism by removing the ability for fascists to enforce rules/regulation/restrictions on individuals, even if that comes at the cost or risk of someone yelling "fire".

Why? (I don't assume anyone cares about my view, so don't feel obligated to read) Because I have higher optimism that humans will eventually overcome their individual immaturity (shouting "pen--" in a crowd), especially through incentive design, than in humans overcoming their tendency towards abuse of power (or even worse, most people who "abuse" power don't think they are abusing it, they have a conviction that the use of power is for some greater good). Wielding power is often the end game of any incentive structure, but yelling "fire" or "p--is" often ruins your reputation/power so naturally is disincentivized over time (or where it matters most).

jimktrains2 · on March 26, 2019

I feel like your "fire" and "totalitarian" examples are confusing, entirely off-base and non-illustrative of anything useful to this conversation.

Why? Because the difference between copyleft and non-copyleft licenses isn't akin to censorship vs no-censorship. The argument for the copyleft is more akin to the arguments for laws in general: someone's absolute freedoms needs to be troddened on to have a free society.

I similarly fail to see how a copyleft is a power to abuse. Surely the ability to close the source of an application has more power that can be abused?

marknadal · on March 26, 2019

Your 2nd paragraph says pretty much what I was trying to say (except for difference in law views) that your 1st paragraph says is off-base.

Another way for me to say it is, that of course you would think my thoughts are off-base since I come from a different foundational base as you. I was just trying to explain the difference itself, not saying that you need to change views (your view is logical from your "base").

You think people's freedoms need to be trodden upon for a free society.

I don't. That scares me and many others.

Edit: I did not downvote you, just FYI, I don't know who/why would.

jimktrains2 · on March 27, 2019

> You think people's freedoms need to be trodden upon for a free society.

Do you take this stance with laws against murder and theft? Society has laws and rules. People as a whole, as all available examples show, do not optimize for the greater good by default and without any rules or norms.

There are good talking points to the copyleft debate, but that copyleft imposes rules and non-copleft doesn't is false and doesn't move this debate forward in any meaningful way.

mysticaltech · on April 5, 2019

Thanks for your honest insight man, really appreciate it! Gun is also extremely cool, here's the link for others https://github.com/amark/gun

spullara · on March 26, 2019

Since it doesn't support serializable transactions I'm not sure why FoundationDB would be mentioned as a comparison in the write up. The operations it does support seem to set the bar pretty low as to what to test.

edit: good reply by the founder of YugaByte but for some reason the comment is dead. I have noticed that when founders don't have an account on here and then something comes up where they need to reply their comments are often deaded.

mbautin · on March 26, 2019

We have run a significant number of useful/practical tests via Jepsen, that only need snapshot isolation level. The tests that were run included a single-key counter test, set tests with and without a secondary index, a "long fork" test ensuring that the order of operations is the same when observed by different clients, and a bank test verifying that the total balance of multiple accounts stays the same when cross-shard transactions transfer funds between pairs of accounts. These tests were run under a variety of failure modes, including different types of network partitions and clock skew. Also, snapshot isolation covers a very large spectrum of practical uses cases, including secondary indexes, for building real-world applications.

Having said that, we have recently added support for serializable isolation level to YugaByte DB, and we will be adding tests to the Jepsen suite for that in near term.

Regards, Mikhail (a co-founder at YugaByte)

gigatexal · on March 26, 2019

We use cockroachDB in production and before that we were on MySQL and as of yet we don’t have a specific usecase where we use serializable transactions. Snapshot isolation or even read committed is just fine. So I don’t think it’s absolutely necessary

gigatexal · on March 27, 2019

To be clear there’s no way around serializable transactions in cockroachDB. We have had to adapt our monolith to it (we’re thinking of ways to make it more nimble by breaking out services etc). But the point I was making was that we had MySQL for a while and never ran into issues with its isolation levels until it stopped scaling. Instead of vitess or some other MySQL system we went with cockroach after finding vitess didn’t fit us — too complicated and too many moving parts. CockroachDB just works. Also moving to k8s adds complexity too for a monolith built and run on VMs. But so far so good. Cockroach runs fast and is performant given production queries. And ops is happy because it self heals.

redwood · on April 4, 2019

First time I've heard about a production use case. Care to share any details?

danburkert · on March 27, 2019

Does YugaByte still use the Raft and HybridTime implementations from Apache Kudu? If so, how relevant are these results for Kudu?

kmuthukk · on March 27, 2019

I wanted to add a few details to the previous reply.

While the Raft/HybridTime implementation has its roots in Apache Kudu the results will NOT be quite applicable to Kudu. Aside from the fact that the code base has evolved/diverged over the 3+ years, there are key/relevant areas (ones very relevant to these Jepsen tests) where YugaByte DB has added capabilities or follows a different design than Kudu. For example:

-- Leader Leases: YugaByte DB doesn't use Raft consensus for reads. Instead, we have implemented "leader leases" to ensure safety in allowing reads to be served from a tablet's Raft leader.

-- Distributed/Multi-Shard Transactions: YugaByte DB uses a home grown (https://docs.yugabyte.com/latest/architecture/transactions/t...) protocol based on two-phase commit across multiple Raft groups. Capabilities like secondary indexes, multi-row updates use multi-shard transactions.

-- Allowing online/dynamic Raft membership changes so that tablets can be moved (such as for load-balancing to new nodes).

regards Kannan (Co-founder @ YugaByte)

mpercy · on March 27, 2019

FWIW, we implemented dynamic consensus membership change in Kudu way back in 2015 (https://github.com/apache/kudu/commit/535dae) but presumably that was after the fork. We still haven't implemented leader leases or distributed transactions in Kudu though due to prioritizing other features. It's very cool that you have implemented those consistency features.

kmuthukk · on March 27, 2019

hi @mpercy,

Thanks for correcting me on the dynamic consensus membership change. Looks like the basic support was indeed there, but several important enhancements were needed (for correctness and usability).

- To make the "online" piece of the membership change work correctly we added support for LEARNER (PRE VOTER) role (where the new member enters in a non-voting mode till it's caught up). https://github.com/YugaByte/yugabyte-db/commit/909d26e31ecd0....

- Load Balancing (which uses the membership changes) is automatic. (https://github.com/YugaByte/yugabyte-db/commit/e4667eb7ec0e6...)

- Remote bootstrap (due to membership changes) also has undergone substantial changes given that YugaByte DB uses a customize/extended version of RocksDB as the storage engine and does a tighter coupling of Raft with RocksDB storage engine. (https://github.com/YugaByte/yugabyte-db/blob/master/docs/ext...)

- Dynamic Leader Balancing is also new-- it causes leadership to be proactively altered in a running system to ensure each node is the leader for a similar number of tablets.

regards, Kannan

mpercy · on March 27, 2019

Interesting. Just last year we implemented improved re-replication (https://github.com/apache/kudu/commit/79a255) which sounds very similar to what you did with LEARNER roles, and we added manually-triggered rebalancing (https://github.com/apache/kudu/commit/ccdcf6 and https://kudu.apache.org/releases/1.8.0/docs/administration.h...).

I'm curious if you did anything to prevent automatic rebalancing from being triggered at a "bad time" or have throttled it in some way, or whether moving large amounts of data between servers at arbitrary times was not a concern.

I am also curious if you added some type of API using the LEARNER role to support a CDC-type of listener interface using consensus.

By the way, we also recently added support for rack/location awareness in a series of patches including https://github.com/apache/kudu/commit/ebb285

We should really start some threads on the dev lists to periodically share this type of information and merge things back and forth to avoid duplicating work where possible. I know the systems are pretty different at the catalog and storage layers but there are still many similarities.

acubel23 · on March 27, 2019

Yes, it does. At the core, the raft implementation is still based on kudu's. But, these areas have been worked on actively so the implementations might has diverged a little.

May be worth looking through the individual issues to see what applies and what doesn't:

https://github.com/YugaByte/yugabyte-db/projects/11

robterrell · on March 26, 2019

Not a comment on YugeByte, but... I love it when a new Jepsen report get released. Kyle Kingsbury has single-handedly raised the bar on an entire industry. (Well, not single-handedly anymore, but still.)

Jupe · on March 27, 2019

Couldn't agree more. There are 3 sources of information regarding database serializability/linearizability:

1. Marketing material (mostly useless)

2. Individual projects/post-mortems (50/50 here; some just mis-use the technology from the get-go, others have valid feedback, but it's tough to determine when either applies)

3. Jepsen Tests (which is more like independently verifiable science)

Sure, you can decide that your social-media solution has no need for consistency (or even durability!) - but in my experience, most solutions don't have that flexibility.

shin_lao · on March 26, 2019

If they rely on clocks, why don't they use PTP? Am I missing something?

aphyr · on March 26, 2019

I think the YB team members are probably best equipped to talk about this, but I can note that while some databases do build their own clock synchronization protocol, many prefer to let the OS handle clocks. For one thing, clock sync is surprisingly tricky to do well, so it makes sense to write daemons that do it well once and be able to re-use them in lots of contexts. There's also the question of HW support: in theory, datacenter and hardware providers could do better than pure-software time synchronization by, say, offering dedicated physical links to a local atomic + GPS clock ensemble. AWS TimeSync is a step in this direction, and I wouldn't be surprised if we see more accurate clocks in the future.

There are still tons of caveats with this idea--Linux and most database software ain't realtime, for starters--but you can imagine a world in which clock errors are sufficiently bounded and infrequent that they no longer represent the most urgent threat to safety. That's ultimately a quantitative risk assessment.

My suspicion is that DB vendors like YugaByte and CockroachDB are making a strategic bet that although clocks right now are pretty terrible, they won't be that way forever. I'd like to see more rigorous measurement on this front, because while I've got plenty of anecdotes, I don't think we have a broad statistical picture of how bad typical clocks are, and whether they're improving.

shin_lao · on March 26, 2019

My comment was that PTP is a much better protocol than NTP, and in the doc they only talk about NTP:

https://docs.yugabyte.com/latest/deploy/checklist/#clock-syn...

https://en.wikipedia.org/wiki/Precision_Time_Protocol

sllabres · on March 26, 2019

https://en.wikipedia.org/wiki/The_White_Rabbit_Project

rkarthik007 · on March 26, 2019

Hi @shin_lao

As @aphyr had mentioned, any NTP-alike system would work. We can update the docs to mention PTP, we do work with AWS Time Sync as well (which uses Chrony).

aphyr · on March 26, 2019

Yeah, you can deploy any type of NTP-alike time synchronization you like. That could be ntpd, chrony, hypervisor-guest synchronization tools, etc etc.

williamallthing · on March 26, 2019

I'm not an expert, but don’t all modern databases suffer from these issues? I thought they all depended on clock sync like Google Spanner.

aphyr · on March 26, 2019

In short, no: many transactional databases don't rely on clocks for safety. I'm going to speak in broad terms here--there's a lot of nuance and special cases that we can dig into, but I'd like to keep this accessible:

You can use CRDTs, and other commutative data structures, to obtain totally-available replicated objects across wide area networks. Systems like Riak do this. CRDTs can't express some types of computation safely, though! For instance, you can't do something like a minimum-balance constraint, ensuring that an account always contains $25 or more, if you allow both deposits and withdrawals, in a commutative system. Why? Because order matters! Deposit, withdraw is different than withdraw, deposit, in terms of their intermediate states.

For order, you can use a consensus mechanism, like ZAB (Zookeeper), Paxos (Riak SC, Cassandra LWT), or Raft (etcd, consul) to replicate arbitrary state machines without any clock dependence at all. These systems require at least one round trip to establish consensus, and their guarantees only apply within the consensus system itself.

What if you have multiple consensus groups? Say, one per shard? Then you need a protocol to coordinate transactions on top of that. You can execute an atomic commit protocol for cross-shard transactions, perhaps using a consensus system. Or you can use a protocol like Calvin to obtain serializability (or stronger) across shards without relying on clocks. That's what FaunaDB does. That adds a round-trip, but if you're clever, you may only have to pay that round-trip cost between different datacenters once.

Another tactic is to exploit well-synchronized clocks to obtain consistent views across independent consensus groups. You can use this technique to (theoretically) reduce the number of round trips a transaction costs, and there are different ways to balance whether you pay increased latency on read or write transactions. Spanner, CockroachDB, and YugaByte DB all take this approach, with different tradeoffs.

Spanner is backed by custom hardware and carefully designed software, to obtain tight bounds on clock error. CockroachDB and YugaByte DB leave that problem to you, the operator.

Often, a database uses a stronger replication mechanism inside a datacenter, but when it comes to replicating between datacenters, backs off to a weaker strategy which doesn't offer the same safety invariants.

ryanworl · on March 26, 2019

While FoundationDB uses Paxos for cluster state (like leader election), it is not on the commit path for a transaction. If any process fails in the transaction system (not storage processes), the cluster is reconfigured by the coordinators and every component is replaced. Transactions do not proceed during failures, but the cluster will replace the failed process in a few seconds and resume.

(This is not meant to be a contradiction, just pointing out an important difference compared to systems that allow progress in parallel with failures.)

aphyr · on March 26, 2019

Ah, my mistake, thank you! It's been a long while since I talked to the FDB folks and I guess I misunderstood where they were using Paxos!

williamallthing · on March 26, 2019

Thanks, this is a really helpful overview!

truth_seeker · on March 26, 2019

Is it also tested against ScyllaDB ? ScyllaDB could be up to 10x performant than Cassandra as backend storage.

aphyr · on March 26, 2019

Jepsen is not a performance test; we verify safety. I haven't looked at ScyllaDB personally, but you can read about Scylla's own work testing their database here [1], and see some of the issues they found here [2].

[1]: https://www.scylladb.com/2016/02/11/jepsen-testing/

[2]: https://github.com/scylladb/scylla/issues?utf8=%E2%9C%93&q=i...

truth_seeker · on March 26, 2019

I never said Jespen is about performance. My comment was only about considering ScyllaDB as backend choice. Thanks for the links anyway.

sidch · on March 26, 2019

YugaByte product manager here. The YCQL API which passes Jepsen has its roots in Cassandra Query Language but does not use Cassandra as its backend store. It’s backend store is DocDB, which is a Google Spanner-inspired distributed document store.

manigandham · on March 26, 2019

Yugabyte doesn't use Cassandra, it's a custom-built database using RocksDB as a key/value layer and with an internal document-store representation.

It offers access to the data in multiple interfaces: Redis, Cassandra CQL and now PostgreSQL.

truth_seeker · on March 27, 2019

Thanks for clarification.