More

rkarthik007 · on Feb 2, 2022

Spanner gets both correctness and low latency from tight synchronization. They do COMMIT_WAIT, meaning wait for the max clock skew to pass. But the max clock skew without TrueTime will be around 500ms (which it impractical to wait out). So, the 7ms is lower latency, but 500ms is impractical (more so than just calling it higher latency). And any other technique to drop latency (without HLC) will violate correctness.

Disclosure: co-founder/CTO of YugabyteDB project

dastbe · on Feb 2, 2022

fwiw i think you’re both saying something very similar. true time has to be correct about max skew in order for it to not break the assumption spanner is built upon. you could also use a looser time bound and still have correctness, but end up with a database that is useless to most/all customers

anonymousDan · on Feb 2, 2022

Interesting. How does this differ from HLC in practice - in the article you say you use a 500ms max skew for HLC?

rkarthik007 · on Feb 2, 2022

The difference is in how the regular path (exercised most of the time) vs an edge case when there is a conflict (typically in larger clusters with a pathological access pattern) works. With TrueTime, the latency is always 7ms and no issues in the pathological cases. With HLC, the latency is lower in most cases, but high in the pathological cases (when it can be 500ms), but these should not matter for many use cases.

vlovich123 · on Feb 2, 2022

What do you think of RIFL/TAPIR?

rkarthik007 · on Feb 2, 2022

Thanks for pointing this out, was not aware of TAPIR. Will take a look, seems pretty interesting.

rkarthik007 · on Feb 2, 2022

> but AFAIK can do that off-the-shelf today using chrony with hardware timestamping or PTP. No need to invent your own.

Actually, the issue is about the max clock skew guarantees (as opposed to the average or median). Even a single violation of this breaks the ACID semantics. So, we do use chrony, but need all this to ensure there is a max guarantee. We would totally have adopted an existing solution - we did look at all alternatives available.

Disclosure: one of the founders of the YugabyteDB project

traceroute66 · on Feb 2, 2022

How would you say your solution compares to the CockroachDB solution[1] ?

[1] https://www.cockroachlabs.com/docs/stable/architecture/trans...

rkarthik007 · on Feb 2, 2022

This blog post covers the various techniques to sync clocks in general (NTP, GPS clocks / TrueTime, Timestamp Oracle, HLC, etc). CockroachDB uses HLC (hybrid logical clocks), which is the same clock sync mechanism as YugabyteDB uses (and Apache Kudu as well).

The page you linked to talks about how HLC is used to get distributed transactions to work, and there are differences here between the different databases (Spanner, CRDB, YugabyteDB, etc).

rkarthik007 · on Sept 5, 2021

When building YugabyteDB, we reuse the "upper half" of PostgreSQL just like Amazon Aurora PostgreSQL and hence support most of the functionality in PG (including advanced ones like triggers, stored procedures, full suite of indexes like partial/expression/function indexes, extensions, etc).

We think a lot about this exact question. Here are some of the things YugabyteDB can do as a "modern database" that a PostgreSQL/MySQL cannot (or will struggle to):

* High availability with resilience / zero data loss on failures and upgrades. This is because of the inherent architecture, whereas with traditional leader-follower replication you could lose data and with solutions like Patroni, you can lose availability / optimal utilization of the cluster resources.

* Scaling the database. This includes scaling transactions, connections and data sets *without* complicating the app (like having to read from replicas some times and from the primary other times depending on the query). Scaling connections is also important for lambdas/functions style apps in the cloud, as they could all try to connect to the DB in a short burst.

* Replicating data across regions. Use cases like geo-partitioning, multi-region sync replication to tolerate a region failure without compromising ACID properties. Some folks think this is far fetched - its not. Examples: the recent fire on an OVH datacenter and the Texas snowstorm both caused regional outages.

* Built-in async replication. Typically, async replication of data is "external" to DBs like PG and MySQL. In YugabyteDB, since replication is a first-class feature, it is supported out of the box.

* Follower reads / reads from nearest region with programmatic bounds. So read stale data for a particular query from the local region if the data is no more than x seconds old.

* We recently enhanced the JDBC driver to be cluster aware, eliminating the need to maintain an external load balancer because each node of the cluster is "aware" of the other nodes at all times - including node failures / add / remove / etc.

* Finally, we give users control over how data is distributed across nodes - for example, do you want to preserve ASC/DESC ordering of the PKs or use a HASH based distribution of data.

There are a few others, but this should give an idea.

(Disclosure: I am the cto/co-founder of Yugabyte)

rkarthik007 · on May 4, 2020

Hi @catblast,

(I am the CTO of Yugabyte) Your points are all completely valid. Just wanted to add my 2 cents.

With YugabyteDB specifically, we are more than just PostgreSQL wire-compatible, we "reuse" the upper half of PostgreSQL to support almost all PG features (examples: stored procedures, triggers, partial functions, etc). So the aim is to build something that has "almost all PG features" while being able to "run cloud-native - with HA, scale and geo-distribution" - our hope is that this allows YugabyteDB to really become a viable option instead of PostgreSQL when apps are being built for the cloud. Here is a blog post on the benefits we realized reusing PostgreSQL: https://blog.yugabyte.com/why-we-built-yugabytedb-by-reusing...

rkarthik007 · on Sept 19, 2019

Hi @hkolk,

Thanks for your feedback! Not sure when you tried yugabyteDB, but our serializable isolation level and YSQL API (which is needed to exercise serializability) were in beta till a couple of days ago. That said, if you can share some feedback, that would help us out immensely. All kinds of feedback welcome - be it about the product or why you feel we are not transparent. Absolute transparency has always been our goal, your feedback will definitely help us improve.

(cto/co-founder)

rkarthik007 · on Sept 18, 2019

Hi @Nican, sure thing.

There are two benchmark workloads, simple inserts and secondary index. The simple inserts workload 50M unique key-values into the database using prepare-bind INSERT statements with 256 writer threads running in parallel. There were no reads against the database during this period. The secondary index inserts does the same things against a table which has an index on the value column (forcing each operation to transactionally update the primary and index tables). We use this simple workload frequently to test scalability of YugabyteDB. If interested, here is the sample app repo: https://github.com/YugaByte/yb-sample-apps

Here is a previous version of the benchmark comparing YCQL (no YSQL here, it was not GA then), it has more details about the scenario we are going for (this part applies to this benchmark also): https://blog.yugabyte.com/yugabyte-db-vs-cockroachdb-perform... The above post goes through more workloads than this one - we have not had the time to run everything on YSQL yet. The aim of this post was mainly to explore write perf and scalability vs Amazon Aurora.

> As another note, I believe Aurora added multi-master support recently. How does that compare as well?

Good question. Multi-master deployment sacrifices consistency with last writer wins semantics, so we did not benchmark against that. For example, the benchmark driver can write to one node and read from another before the data was replicated. But may still be interesting from a tradeoff perspective (perf vs consistency).

Also, note that we have also just announced multi-master support between separate YugabyteDB cluster - so a benchmark is probably something we should do at some point anyway!

rkarthik007 · on Sept 18, 2019

Yup, already done - we dropped it from the title of the post. Our aim was never to misrepresent. Its a difference of opinion on what "Jepsen testing" represents to us - core transactional correctness or transactional DDL (which most distributed DBs do not support). But the nuance is not worth the hassle, and we're working on it anyway - so we'll just update this back in a few weeks.

rkarthik007 · on Sept 18, 2019

We are working on r2dbc, this is currently an active project. We have made good progress so far. If you have a use case or are in guiding the project, please join our community slack!

rkarthik007 · on Sept 5, 2019

Very astute observation! Two things:

* We implemented a feature to share memory between these connection handling processes so that makes it a bit more efficient

* Longer term we are thinking of switching to a thread based model. This is how our other API YCQL works.

(cto / co-founder)

ralusek · on Sept 5, 2019

I think this is a very important thing to prioritize for a DB of this nature. Postgres/RDBMS connections in general are an extreme weak spot for containerized/serverless applications.

DynamoDB/ElasticSearch's approach of not needing to establish/maintain a connection seems like the ideal solution. I think for applications that would utilize your DB, a minor hit to the speed of an individual request is well worth it as compared to being able to scale up the amount of simultaneous requests.

rkarthik007 · on Sept 5, 2019

Yes absolutely spot on!

Peeling the onion one more layer, there are two underlying features that are required:

1) Move to a threaded model to be able to scale instantaneously. This is already the case with YCQL, we're planning on doing this for YSQL.

2) There needs to be a change to the client drivers in order to become smarter to deal with how the app connects to the DB. In fact, we're working on enabling this for the Spring (Java) ecosystem by enhancing the JDSB driver (still in the early stages): https://github.com/YugaByte/ybjdbc

(cto/co-founder)

pdeva1 · on Sept 5, 2019

what does the JDBC driver has to do with thread per connection in the database? jdbc doesnt care if you use threads or processes for the connection

rkarthik007 · on Sept 5, 2019

True for a single node. However, there will be multiple IP address/connections anyway, since YugaByte DB is a distributed DB and it runs across multiple nodes. JDBC drivers connect to only one node - so to provide true connection scaling, the client side would need to become aware of a multi-node cluster and do connection pooling across these. Of course there are other issues also that need to get solved that are slightly related:

* The node given to the JDBC driver could be down, or nodes could get added/removed over time. This makes "discovery of nodes" a problem.

* We could use a random round robin strategy to connect to nodes - where client connects to a random cluster node which internally connects to the appropriate node. This would result in an increased latency, and also an increase in the net number of connections needed.

These may not matter if the load balancer is smart like in the case of Kubernetes (and where the extra hop becomes mandatory as well). But for non-k8s deployments, these help.

Maledictus · on Sept 5, 2019

Would async/event based work? Of course with at least one thread per core.

rkarthik007 · on Sept 5, 2019

We are working with the R2DBC folks on reactive programming for async/event based use cases as well. This work is just getting kicked off. If interesting, please join our community Slack and give us any feedback/thoughts.

rkarthik007 · on Sept 5, 2019

Would much appreciate some constructive criticism! - CTO/co-founder