Citus' Replication Model: Today and Tomorrow

elvinyung · on Dec 18, 2016

Somewhat tangential, but I'm kind of sad that not much more was done with pg_paxos[1].

Group Replication[2] just made it into MySQL 5.7.17 and is GA; it uses a variant of Paxos to commit transactions. Even though most people will probably do fine with streaming replication or semisync, it would be nice to have support in Postgres for use cases that need an even stronger level of consistency.

1: https://github.com/citusdata/pg_paxos

2: http://mysqlhighavailability.com/mysql-group-replication-a-q...

bithavoc · on Dec 18, 2016

What would be the advantage of Citus over AWS Aurora or AWS RDS?

craigkerstiens · on Dec 18, 2016

Craig from Citus here, RDS is great if you're data fits and performs well on a single node (if your data stays under ~ 100 GB there is really no reason to think about sharding and distributing your data). Whether you need to scale out depends heavily on your workload, some applications are just fine to scale storage, others need to scale memory and compute. Citus in that sense is a bit different from RDS as well as Aurora as we extend Postgres to distribute your data across multiple nodes. Thus, when you add a node to the cluster you get an increase in storage, memory, and cores that are doing work for you.

CaveTech · on Dec 18, 2016

RDS is performant well above 100 GB. Not to say citus isn't a good product, but to imply sharing is required anywhere near the 100 GB threshold is a bit disengenous.

Note if using aurora you also get an increase in memory and cores if using multiple nodes...

daurnimator · on Dec 18, 2016

I believe he was suggesting a lower bound. ==> If you're over 100GB you should probably have a discussion around future plans and what you'll need to do to scale to 10 times that while keeping response times etc.

craigkerstiens · on Dec 18, 2016

As other commenters have mentioned, indeed my goal wasn't to imply that you need to shard at 100 GB.

Yes, RDS still works great at 100 GB of data, for 99% of applications. I used to generally never advise to shard until 1 TB or really think about it until about 500 GB. From dealing with customers that have sharded as early as 50 GB of data and as late as 3 TB of data, those that do shard earlier always have a smoother time.

The hard question is if you actually need to, if your data and indexes stay in cache then there, of course, is never a reason it's a matter of if you predictably know you'll grow and need to. Often for various B2B products that are about to sign up a large customer that will guarantee this growth ahead of time you can plan for this which makes life a bit easier. As a rule of thumb I wouldn't think about it before 100 GB these days, and once you hit that point I'd start to look at what your growth pattern is and at least have a plan if you expect to outgrow a single node.

DasIch · on Dec 18, 2016

The only implication here is that above 100 GB sharding becomes something that's reasonable to consider.

javitury · on Dec 18, 2016

I've been interested in PG sharding for a few months. I mostly hear about this citus extension, but what about postgres-XL? Does any one has had any experience to compare against citus?

ddorian43 · on Dec 18, 2016

XL has too many moving parts (coordinator + data node + gtm + gtm slaves) and no (auto) high availability