Just to be clear, this would mean a PG extension that works with the new PG plug...

rkarthik007 · on Dec 3, 2018

Hi @manigandham, great question. As it stands now, YugaByte would not be a PG extension. It will be a fully distributed but standalone PG cluster called "YSQL" API layer. We have started with the 10.4 version of PG for building distributed YSQL.

PG is planning the pluggable storage API for version 12.0, but these will only address user tables. This is certainly essential, but not sufficient for true distributed SQL. Some of the other pieces are: * Pluggable storage for system tables * Ability to create the initial set of system tables in a pluggable manner (initdb equivalent) * A number of other changes to the upper half of PG in order to make it "stateless" and scale-out, such as modifications to how certain operations are executed, which locks are held, what operations are pushed down to the underlying storage layer, etc. * Support for different types of indexes * Enhancing the optimizer to understand different storage types (in this case a distributed store)

In our current implementation, we have tried to make APIs for the above where ever possible. We need to explore how we can make runtime hooks for the other places. The plan is to see how to contribute these to the open source over the longer time frame so that we can create a self contained extension (though we are nowhere close to that today). Since our PG modifications code-base is open-sourced, the hope is that it would make contributing these changes back to PG easier.

kmuthukk · on Dec 3, 2018

YugaByte DB's design is that a YB cluster supports Postgres in a native, self-contained & scale-out manner (much like YugaByte's Cassandra and Redis flavored offerings).

At a high-level, the upper half of the Postgres DB is being largely reused. The lower-half, i.e. the distributed table storage layer, uses YugaByte's underlying core-- a transactional and distributed document-based storage engine.

For the DB to be scalable, the lower-half being distributed is necessary but NOT sufficient. The upper-half also needs to be extended to be made aware of other nodes executing DDL/DML statements and dealing with related concurrency while still allowing for linear scale. Also, making the optimizer aware of the distributed nature of table storage is the other major piece of work in the upper-half.

These changes required in the upper half is what makes the "100% pure extension" model a bit harder... but that's something we intend to explore jointly with the Postgres community.