We want to write a paper on the vast extensions ecosystem for Postgres at some point. It's on my TODO list for 2023.
Having so many plug-ins/add-ons available is unique to Postgres and due to Stonebraker's original vision of including extensibility as a first-class design principle in the DBMS.
In doing so, it would be great if you could provide details around extension safety. I've heard that some extensions may not play nicely in combination with others, and it would be really nice to know how one is to vet extensions for use in production.
these are two different things - wasmer runs WASM code inside postgres (as an extension), whereas our implementation runs an entire Postgres server as WASM (eg, you can run it in a browser: wasm.supabase.com)
We weren't the first to do this though - CrunchyData was. Theirs wasn't open source, so we worked with Snaplet to build an open source version and added networking: https://news.ycombinator.com/item?id=33067962
I want to ask a question to the engineers with experience in WebAssembly.
Last year I invited a student to compile ClickHouse with WebAssembly. But the task appeared harder than expected, supposedly due to poor 64-bit support in WASM.
Possibly lack of context. Without the edit, I would assume you're going for a version of "webscale" meme about postgres. (Haven't heard of planetscale before)
Ah interesting. Nope, definitely talking about a real database. I think cloud native is interesting, and two solutions I’m aware of in the space are CockroachDB and Planetscale.
The best performance is achieved with the combined method, named 'pread_threadpool'. It uses the `preadv2` syscall to check if the file is in the page cache.
And there is a pull request adding io_uring: https://github.com/ClickHouse/ClickHouse/pull/38456 - Unfortunately, it's unfinished and cannot be merged because the CI has found bugs. Nevertheless, the advantages of io_uring for analytical databases are negligible.
> The best performance is achieved with the combined method, named 'pread_threadpool'.
I found the same when testing my own database engine. `pread` in a suitably well-designed threadpool outperformend every other option for NVMe random-access 4k reads not in cache.
Variations on how many and which types of locks, work queuing and ordering, and when to start and stop threads and control their number made a considerable difference as well. In certain system configs, `CLONE_IO` made a difference. I use tiny (smaller than 1 page) userspace stacks with `clone`-based threads, and dynamic auto-tuning of the number of blocked and executing threads.
That's `preadv2` with the `RWF_NOWAIT` flag. It proved slower when I tried it in my db engine, which was a bit of a surprise. `RWF_NOWAIT` is used to read data synchronously from cache, before passing it to the thread pool to read asynchronously if not in cache.
I expected an average speedup when there are many cache hits, so I was surprised and disappointed to find the `preadv2(..,RWF_NOWAIT)` syscall to be slow enough that it was usually a performance loss overall to use it, at least on the kernel versions and hardware I tested on (a reasonably fast software-RAID NVMe).
A nicer way to look at that is that the auto-tuning thread pool was sufficiently sleek and fast that the asynchronous read was fast enough to make the benefit of a synchronous path too small to be worth it.
One not in your list is mmap_threadpool. I found for many workloads, that was faster than pread_threadpool, and of course it does a better job of sharing memory with the kernel. Unlike synchronous mmap, it is effectively an asynchronous read, where the thread does the page fault instead of syscall, so the main thread is not blocked and device I/O queues are kept full enough.
Compared with `pread_threadpool` equivalent (in my db engine), I found `io_uring` was sometimes similar, sometimes slower, never better, so not the preferred default. It makes sense that it could almost reach the devices' I/O capability, though with less control over queue depths than doing it directly in threads.
But I was surprised that the "zero-syscall" queues of `io_uring` didn't provide a noticable improvement over `pread` syscalls, given that I measure a considerable baseline syscall overhead of all syscalls like `pread`, `preadv2` and `futex`, with that overhead having a significant throughput effect in `pread_threadpool` equivalent (because the NVMe devices were fast enough for syscall overhead to affect throughput).
> - Unfortunately, it's [io_uring] unfinished and cannot be merged because the CI has found bugs.
I found what I think is a subtle memory barrier bug in `liburing`. If the ClickHouse implementation is using `liburing` or copying its methods, it's conceivable that may be the cause of the hangs seen in CI. There are also kernel versions where `io_uring` was buggy, evidenced by changes in later kernels to fix bugs.
There are a number of ideas in the database space that the industry is adopting across the board:
- Separation of storage and compute (Neon, AlloyDB, Aurora). Every cloud database should built one. It's a big undertaking, but benefits are undeniable.
- Good query processor for analytics (Snowflake, Velox, Singlestore)
- Open source. Especially in OLTP open source == trust
- HTAP. Can run mixed workload (Singlestore, Unistore): both OLTP (apps) and OLAP (reporting). This has always been a dream, but we still live in the world of dedicated systems: E.g. Snowflake and Postgres.
- Shared nothing sharding (Vitess). This is the most controversial as you lose compatibility with the mothership (MySQL for Vitess). So it's unclear this will be the dominant architecture in the future. I think the world may get to "dynamic sharding" where storage stays separate and compute can be multinode and the user can easily and instantly change the number of nodes.
The separation of storage and compute is controversial, though it is definitely convenient for cloud operators. It is literally going backward in terms of the available storage bandwidth, and many modern database architectures can very effectively use all the storage bandwidth available in modern systems. There is not enough network bandwidth, and won’t be for the foreseeable future, for running storage over networks to be efficient. Database workloads are also starting to move to edge environments, which make it even worse. The resource mismatch currently leaves a lot of compute sitting idle in practice.
Another issue with this model is that it tends to offer poor ratios of bandwidth to storage size. This is at odds with the desire to run mixed workloads in a single system.
The separation of compute and storage is most of all convenient for whoever operates the database.
Need more compute, or scale it down when not in use? You can add it in minutes, no re-balancing, no careful moving of data. Need more storage? You have it on tap. Want to give some group of users their dedicated compute? Or run differently configured computes on top of one data, including even using different compute engines for batch and for interactive? You can.
Compare Snowflake "create warehouse", which can give you compute that can immediately access any data, with a dance you need to do to resize Clickhouse and decide which tables are replicated in what way.
At the same time, everything you say is true, and that is why system without storage-compute separation can give you order of magnitude more performance for the same money than a separated system can. If someone could solve the fragility and hassle of it, it would be quite a winner.
> The separation of storage and compute is controversial
I don't think it's a controversial topic, it's an emerging trend. See for example this slide from the recent RocksDB meetup on their vision for the next 3 years of RocksDB's future:
Disaggregated storage is becoming quite a hot topic, and there are several technologies fueling this.
There are two major advances happening in terms of disk I/O right now (that aren't to do with async IO uring stuff)
1. NVMe Zones + NVMe FDP specs, which are ways for userland applications to control placement of writes on drive (like a software Flash-Translation-Layer)
2. NVMe-over-Fabrics or similar technologies (with rdma/tcp) enables the access of a remote disk with 'comparable' latency to that of a local one.
E.g. if local disk access is 100us, AWS can give you 110us with NVMe-OF.
These two advances support the growing trend of disaggregated storage
---
Full credit to Arun George from the Samsung Memory group for explaining this to me.
The storage-over-fabric latency improvements allow it to work well for software that was not particularly optimized for storage performance, which has obvious benefits that I am not arguing. There is a good reason storage-over-fabric has been a popular database deployment model for decades, it does have advantages, but this emerging trend is nothing new nor are the issues that it has.
For systems that are highly optimized for storage performance, storage bandwidth tends to be more important than storage latency and that is what is being sacrificed for the sake of disaggregating storage. I see integer factor reductions in workload throughput when deployed on disaggregated storage versus the usually NVMe JBOD, and that factor tracks real-world storage bandwidth pretty well, particularly at high storage densities. At scale, these economics start to matter quite a lot.
To be clear, disaggregated storage has no effect on database design generally, seeing as how it has been widely used for decades. It does put an artificial floor on achievable throughput for many workloads due to the bandwidth bottleneck. When people realize this then the pendulum swings the other way (again), so I wouldn't want to overfit for disaggregated storage.
No, I still stand by my opinion that I don't find disaggregated storage controversial
But if you're going to take an opinion on databases from someone, jandrewrogers is probably a good person to take them from. Me, on the other hand, I'm just a hobbyist/enthusiast, I've no accolades under my belt and my experience on the engineering end is shy of a year.
You and jandrewrogers are using different connotations for this word. Separating compute and storage makes sense when you have already done this like the hyperscalers have. It is good for them but not necessarily for the end user.
That it is surprising to you are not doesn't say whether it is a good idea or not. If you make a statement regarding your own opinion vs an expert, then that statement is not referentially transparent, why?
When disaggregated storage gets predicate pushdown will it still be disaggregated?
There are hybrid designs for separation of storage and compute that are aimed at mixed workloads[1]. They avoid writes to remote storage on transaction commit (i.e., act like a shared nothing databases for commits, but still push data asynchronously to a shared remote disk that can be used for scale up/point in time restores/branching).
SingleStore is a kind of uninteresting example because it is not open-source. It makes potential customers feel non-confident. Why move large data volumes to SingleStore if the company can cease to exist in a year?
Yep, SingleStoreDB (formerly MemSQL) is not open source (probably never will be), but it does have many paying customers who have had workloads in production for over a decade at this point.
Also, isn't taking a bet on a very recently launched database as a service based around an open source database also pretty risky? Say for example the very recently launched Clickhouse Inc. service (which your a co-founder of?).
Adam, you still synchronously replicate each log record to 2 places, right? Technically this should be roughly equivalent to writing into a consensus. And use write through cache for reads.
Yep, for writes network bandwidth usage is independent of separation of storage of compute in some sense. Any database that provides high availability is writing over the network somewhere before it acks a transaction committed. It matters who is on the other end of that network write though. Take the typical Cloud DW (i.e., Snowflake) design of forcing writes to the blob storage (shared remote disks) before commit. That is a much higher latency write then what a high performance transaction log replication protocol will do to replicate a write to another host.
> The resource mismatch currently leaves a lot of compute sitting idle in practice.
When separate storage is multi tenant and compute is elastic you can really squeeze more efficiency. With that you can keep utilization extremely high, b/c storage is effectively spread across the whole fleet. We really see this now running 10K databases on a single (single per region really) storage deployment.
Compute is also elastic, so you again can drive utilization up. So my argument is that separation of storage and compute is not necessarily for performance but rather for utilization, convenience to operate, and additional capabilities (branching, serverless).
Sure, for aggregating many small databases this makes sense. Disaggregation of storage and compute has been popular at many times in database history, the tradeoffs are well-understood. Some databases are not particularly performance sensitive and it eliminates resource waste from resource overhang when databases only need fractions of server. For single large databases that occupy many servers, the economics tend to follow the size of the cluster which can be significantly smaller with direct-attached storage. Scaling storage independent of compute introduces some interesting edge cases operationally because they are intrinsically coupled to some extent — scaling one creates resource pressures and bottlenecks elsewhere.
An alternative model to disaggregation that can produce a similar dynamic balancing of compute to storage without sacrificing bandwidth in scale-out systems is to use a dynamic mix of heterogeneous nodes. The storage is still tightly coupled to the compute but the ratio of compute to storage of the aggregate system can be quickly adjusted on the fly by adjusting the mix of server types. I haven’t seen a lot of work on heterogeneous cluster architectures in a long time — it was impractical when database clusters were capex — but it is a proven model albeit complex to implement, and easier to deploy today thanks to the cloud.
Many of the emerging data problems at the edge look like heterogeneous cluster architecture problems too if you squint, so it likely has value beyond the data center. The edge has a lot of surprisingly-shaped problems that don’t fit any of our current tools e.g. some of the most extreme scale-up database problems I’ve seen in any context.
Regarding separation of storage and compute, do those DBs you mention enable merging the application backend code and db compute node to a single process? Meaning I would deploy my application to the DB compute node and use its runtime, or have the compute node present as a component I can import to my backend code as a library. I think this is one powerful thing about Datomic (and maybe XTDB?).
I wouldn't say HTAP is an across the board trend. The need is mostly in the enterprise where we see high value mixed workloads.
Neon will take a Postgres centric POV. I consider a few approaches, but not committing to any particular at this point. The reason to not commit is that there is a lot more opportunity for Neon in serverless, devtools, and edge. And it makes a ton more sense to partner for OLAP workloads. So what is Postgres centric POV:
- Plugins like Timescale and Citus
- Seamless integration with sending CDC to Snowflake, Singlestore, Databricks, Clickhouse, and Motherduck
- FDW for integration with Snowflake, Singlestore, Clickhouse, Databricks, and Motherduck
- We looked at putting DuckDB on Neon storage with Jordan Tigani, but it was too big of a lift as DuckDB doesn't push all data through the WAL. Maybe in the future.
The reason we run OLAP bench is because when you separate storage full table get impacted: you fetch pages from remote storage one by one and the mitigation is prefetch. So our goal is to be on par with vanilla Postgres. We don't care about winning OLAP benchmarks.
ClickBench is still ok to compare non-OLAP databases on OLAP workloads - filter away the column-oriented DBMS, and you will be left with the comparison of non-OLAP databases.
I think it makes sense to add NeonDB for the sake of completeness.
I think lots and lots of benchmarks are useful internally. They highlight where the soft spots are. This is however very different from publishing benchmark results.
As a dbaas vendor we should publish benchmarks that your user base really cares about which demonstrate that the system is mature and can handle the core workload your users expect you to handle. That is why publishing an OLAP benchmark results for an OLTP system makes very very little sense.
(BTW OLAP systems like Clickhouse should really publish TPC-DS SF1000+ results)
For OLTP systems there is TPC-C but it's actually not that telling as it is so simple. I can start seeing that people in OLTP care more and more about latencies, hence all the edge related work and optimization of the drivers.
So Neon will be publishing end-to-end latencies benchmarks and also make sure it is on par with vanilla Postgres on a broad set of benchmarks. Those we will run internally and the results will simply be commits on our github.
(just want to say i am extremely impressed by your answers in this thread - very precise and clear)
> there is a lot more opportunity for Neon in serverless, devtools, and edge
serverless i get, but am not sure about the other two. i guess my conception of database business is that it is a volume biz, ie mainly the terabyte-petabyte scale workloads are the valuable enterprise business. high value mixed workloads are a subset of that but i’d imagine “postgres++” would be a good value proposition as well (easy migration with no replatforming).
my impression is that devtools/edge would be relatively low volume. perhaps high margin but not enough to make up for low volume. do i have a misconception here?
- Tier1: Unique large scale workload: think about moving money around. This is ~$10-20Bln market mostly dominated by Oracle, Microsoft and IBM. All next gen scale systems kind of forced to play in this market. Deals are large by sales cycles are long.
- Fleets of tier2 apps. Each enterprise now has a fleet of every major database offering to power lots of apps. To win this market you need to be a top 3 database in the enterprise. Most money is made by SQL Server, but Postgres has a real shot here. This market is driven by developers and SQL Server takes most $$ closely followed by AWS. This is Neon’s bet. That after getting a wedge in a low $$ SMB/hobbyist market Neon takes mid market and then moves to enterprise. The size of this market is also $10-20 Bln.
- OEMs. This is another $5-10 Bln. Neon only cares about dbaas but happily embed in everything cloud.
Scylla also uses a shared-nothing sharding approach, they have an open-source "asynchronous I/O engine" toolkit that is used inside of ScyllaDB called Seastar
They have some good reading on this approach in the Seastar docs if you're curious, it's a great general introduction to Shared Nothing:
This is usually generalized to some variation of "Physical-OS-thread-per-CPU-core" architecture, where each CPU core gets it's own dedicated chunk of data. You use hashing or some other mechanism to route requests for data to the right core.
Each CPU core can have user-land concurrency or "virtual threads" (fibers/coroutines/etc) as well, so you're not limited to sequential code necessarily.
A combination of thread-per-core x userland-coroutines-per-thread is how modern query processing approaches like "Morsel-Driven Parallelism" work.
I use colon separated syntax to indicate the multiplexing of different hierarchy levels of thread/coroutine/lightweight threads.
1:M:N means 1 scheduler thread, M kernel threads and N lightweight threads. So 8 lightweight threads multiplexed over 2 kernel threads multiplexed by one scheduler thread looks as this:
1:0:0
1:0:1
1:0:2
1:0:3
1:1:0
1:1:1
1:1:2
1:1:3
Can use ringbuffers to route data between threads, there's a multiproducer multiconsumer ringbuffers available. I also wrote what I call token ring parallelism where there is no locks but threads take turn writing but read as fast as they can.
Essentially there is a state machine that each thread is in reading mode or writing mode. Each thread enables the writing mode of the next thread THREAD_ID + 1 % threads.size(). When the last thread is reached, writing mode is enabled and then when that finishes, reading mode for all threads is reenabled.
Need some approach to representing data flow between threads that is efficient and is pull/push driven.
Scylla is a C++ rewrite of cassandra, the threading has been better as well with thread-per-core. Cassandra 4.0 has some thread-per-core rearchitecture so it remains to be seen if Scylla has performance advantages it used to tout.
I have not checked if Scylla ever caught up to Cassandra 3.0's feature set, last I checked it had not. It may be "going its own way" on features as well. I believe (they can correct me).
My personal recommendation would be to start with Cassandra, and when operations allow it, move to scylla if performance testing shows it is better. There is also exotic stuff like Rocksandra. Cassandra and Scylla are fundamentally similar in operations (add/remove nodes, scaling, etc). The scylla folks will contend the operations are easier since they get more perf per node so you need less nodes.
It's a good idea with cassandra / scylla to bootstrap with a performance testing, data loading, provisioning loop. It gets you used to operations and the tooling for backups, loads, and load testing and metrics.
I mentioned it recently[0], but this looks like a very good topic to plug our new database, Seafowl, that we released last year [1]. It also uses Apache DataFusion (like IOx) and separates storage and compute (like Neon, Snowflake etc) but is designed for client-side Web apps to run analytical SQL queries over HTTP (using semantics that make the query results cacheable by browser caches and CDNs). This makes it really useful for things like interactive visualizations or dashboards.
We're currently doing a lot of work at Splitgraph to reposition the product around this "analytics at the edge" use case, with usage-based billing, eventually moving our query execution from PostgreSQL to Seafowl.
You can use ClickHouse as a stateless query engine to process externally hosted datasets.
This way you just publish a MergeTree table to S3 bucket or a set of static files on an HTTP server, and then plugging it on an empty ClickHouse server.
My takes for where databases are going in 2023 and beyond:
Separation of storage and compute is clearly the right direction for cloud databases. Pay for what you use. A cache between object storage and compute will paper over the differences and enable supporting workloads that match yesterday's and today's typical application architectures.
Still waiting for a database architecture that embraces the reality of distributed systems / concurrent mutation and exposes the right set of tools to reliably hit the theoretically minimum tradeoffs across a wide range of applications. CRDTs for example are making progress here but have not been generalized enough yet.
At some point transactional workload volume will grow to what we call "big data" today, and we'll be forced to use more storage-efficient columnar data representations for OLTP. I'm curious to see where 'build your own database kit'-DataFusion and efficient dataframe tools like polars/arrow2 can take us.
> At some point transactional workload volume will grow to what we call "big data" today, and we'll be forced to use more storage-efficient columnar data representations for OLTP.
I wonder.
I built some proof of concept storage using row representations with a novel compression scheme designed to compete with compressed columnar format for space while being fast.
The result is very storage efficient compared with competing alternatives that I started from.
It has fewer writes per stored row than columnar storage if you don't have large batches of rows to write (often the case with OLTP), and fewer reads per full row when doing lots of random access point queries (sometimes the case with OLTP and certainly the case for the application I did the tests for). I didn't compare with a compressed columnar representation, but in theory (there was some compression theory) it should take same or less space in many cases as well as using less I/O.
(It also was neither B-tree nor LSM-tree but had some desirable performance characteristics of both, but that's another story. Not a fractal tree / B^ε tree either, as sometimes people ask that).
Presumably, well-funded groups are out there doing a much better job than I did in my spare time, so I wouldn't be surprised if "big data" OLTP doesn't end up requiring columnar representations after all, or if a scheme such as the one I used ends up competing with columnar formats in the end for space and performance reasons.
I've also been thinking about future data representations; in particular, some blend of LSM and fractal trees combined with a blend of a generalized hierarchical prefix encoding and delta encoding. I'd be interested to see what you've got.
> I built some proof of concept storage using row representations with a novel compression scheme designed to compete with compressed columnar format. ... I didn't compare with a compressed columnar representation, but in theory .. it should take same or less space .. as well as using less I/O.
Some of your statements felt a bit discordant when placed together.
> Some of your statements felt a bit discordant when placed together.
Ah, thank you, I think I see what you mean.
I designed compression methods with field-aware entropy coding in row form while maintaining enough indexing structure (also compressed) for logarithmic-time queries and updates. I did this while being aware of columnar form compression and wanting to replicate a similar size benefit when used on data that compresses particularly well in columnar form, especially things like low-cardinality fields (e.g. booleans and small enums) which compress very well when partially sorted, and data like time series and similar series on branching DAGs, which compress well with a delta compressor or other predictor.
Using some concepts from compression theory I convinced myself how an optimal structure of this type in row form is size-equivalent (up to rounding errors) to optimal encoding of the data in a columnar form, and degrades in a similar way when used with lower quality probability models. Notably, with suitable probability models, run-length encoding, numaric delta coding (such as time series), string delta coding, and low cardinality values in compressed columnar form appear to be roughly matched in theory.
At the same time, the hidden keys stored in columnar form to connect columns together are not required, which saves space in the row form, but the new indexing structures to ensure logarithmic-time queries take up space instead (optionally, if you want the faster lookups), with different bounds, so these are somewhat balanced, depending on how hidden keys are represented.
So I did some comparisons at a theoretical, more abstract level between row and column forms, when considering columnar compression for different types of data and designing the rows and indexing structure with that in mind.
However, I didn't compare what I implemented with a real implementation of compressed column form of the same data, to see if the theory worked as well in practice as it did in theory.
What I did do instead is try out parts of it on some existing large databases, in the terabytes range to confirm how well various approaches compressed the space required compared with other implementations that also used row structure.
I knew that the databases the POC was designed for were hitting I/O bottlenecks before CPU bottlenecks in some phases of data usage, though not others, in other implementations, so I had some leeway to use more CPU to spend on compression to reduce streaming I/O, and motivation to reduce numbers of random access I/O by using an appropriately clustered representation, i.e. row form for most fields, as well as an appropriate inter-block tree structure. In the process I found some ways that intra-block field compression and inter-block merge+lookup structures complement each other, each helping the other to perform better.
Between all the tests and theorising I reached the view "should take same or less space... as well as using less I/O" (for the applications being targeted in particular), but I haven't actually implemented a columnar equivalent or compared with a columnar implementation to confirm this view. It's something I would like to do in due course, though using a configurable hybrid of row/column rather than fully one or the other. This is work in progress but spare time has been elusive (it's a side project, and I've spent recent months on another side project, and the "day" job).
RDBMS haven't evolved at all for a very long time but only doing minor incremental improvements and consumers have to turn to complex queries as the principle never changes. It's time we get something more modern.
- It has web UI to handle data.
- It can write logic with JS (The more logic in DB the better.)
You can put it in git by writing the defined functions in a file.
You may use DB migration tools to run those.
It's better because logics in DB cannot be bypassed. If you write a logic or check in, say PHP, and part of your app is run in NodeJS, you have to duplicate those checks and logics. It even works when you interact with the database manually with some GUI tools which prevents you from manually inserting bad data.
With that attitude we would stick to assembly forever though. I don't know the exact requirements, but I have a hard time imaging that Javascript is a great choice, except for how many people know it.
Surreal looks good; I checked it a few times. I was wondering if the Rust is compiled to wasm, given surreal supports offline/online data sync, could it be used in the browser and app as an offline solution? Like pouch/couch? As that’s what most solutions are missing for the projects we do.
I am writing a Python based SQL query engine: https://github.com/marsupialtail/quokka. My personal goal is to get Andy Pavlo to mention it in his year-end blogs.
I agree with many of the points made in the blog by Andy. Writing a distributed database has become way easier due to open source components like Ray, Arrow, Velox, DuckDB, SQLGlot etc.
I personally believe we will see a switch from JVM based technologies to Rust/C based with Python wrapper
While the author seems to be an expert in production or web scale databases, I feel like this was a big year for SQLite also! Many people including myself are looking at it for server side use cases.
I'm the founder of RisingWave Labs. Glad to know that our company was mentioned in the blog.
Technology wise, agreed with Nikita that storage-compute-decoupling is the trend. Severless will also be the main focus.
Market wise, I observed several trends:
* streaming. streamlining the data pipeline is a new trend. RisingWave and Materialize are both addresssing this market. They will probably need to figure out how to compete against established solutions like KsqlDB and Apache Flink.
* real-time. There are probably 20+ vendors selling real-time OLAP databases. They all claim that they are cheaper, faster, better than Snowflake/Redshift. Just name a few: Druid, Pinot, ClickHouse, StarRocks etc. Let's see who will be the winner in 2023.
* small data. The biggest winner in this domain is Motherduck, a startup commercializing DuckDB. Redshift/Snowflake are a overkill for most companies who do not own large amount of data. Motherduck may be the right choice.
* hybrid. Several companies have started selling hybrid databases (mainly HTAP for now): Google AlloyDB, Snowflake Unistore, SingleStore, PingCAP TiDB+TiFlash, etc.
In 2023, we will also watch how Databricks challenges Snowflake in the data warehouse domain.
> * real-time. There are probably 20+ vendors selling real-time OLAP databases. They all claim that they are cheaper, faster, better than Snowflake/Redshift. Just name a few: Druid, Pinot, ClickHouse, StarRocks etc. Let's see who will be the winner in 2023.
These are faster but they are also very limited. Most of these have very limited support for joins, can't do proper SQL (e.g. no CTEs in ClickHouse) and require a lot of careful design for your data (e.g. StarRocks can do smart pre-aggregation but you have to tell it which columns are dimensions and which are values).
That is fine for limited use cases - like "dedicated engineering team is building specific application" - but doesn't really work for general purpose data platforms.
Snowflake I think is successful because it is pretty good at handling everything you throw at it, even if it costs a lot of money.
Clickhouse has joins and CTEs just fine. It is however a column oriented database and joins are generally discouraged instead in favor of denormalized wide tables. That said it has extremely fast lookup mechanisms via joinGet() or dictionaryGet()
Most SQL Analytical databases don't discourage joins (think BigQuery, Redshift, Snowflake, etc.) and all of them are columnstores. I think discouraging joins is something very specific to Clickhouse, Druid, Pinot and others that have very limited support for joins.
ClickHouse at least supports local joins quite well.
Perhaps another way to put it is that BigQuery, Redshift, and Snowflake are not optimized for real-time response on large, wide tables. ClickHouse has features that allow it to pack multiple entities in a single table, then pull data out in a single scan. This includes tricks like simulating joins using aggregation. [0] It's a great design for feeding tenant dashboards with fixed latency (say 2s or less) and predictable cost. This use case is shared by many SaaS offerings.
Over time I think the differences will become less as current database engines converge on commonly required features. ClickHouse join types have expanded over the last year and features like join reordering are in the 2023 roadmap. [1] Conversely incumbent cloud databases are adding features to support real-time analytics.
I work on ClickHouse at Altinity and can't speak for Druid and Pinot. Perhaps someone else with detailed knowledge can chip in.
That is interesting, because dbt-clickhouse claim it doesn't do ephemeral models (i.e. CTE includes) on purpose because ClickHouse doesn't support it. But I'm looking at changelog and CTEs are there since 2020. Hmmm.
I guess it has to do more with lack of optimizer that can do more advanced stuff like join reordering, pushdown or subexpression reuse (though even Spark and Trino don't have that one).
I'm curious as to where you see the claim that dbt-clickhouse doesn't support ephemeral models because ClickHouse doesn't support CTEs. To be honest I don't know the historical reason why dbt-clickhouse doesn't support ephemeral models (ClickHouse Inc. took over this community project early last year). In any case I've opened a new issue to add support. https://github.com/ClickHouse/dbt-clickhouse/issues/126
Druid is indeed a bit faster than Snowflake, but a previous team I worked on was able to achieve similar latencies under heavy load using some clever techniques with materialized views.
Happy to find this blog, wish there was more talk about NewSQL databases. I've personally seen the amount of cost that goes into migrating off of traditional RDBMs systems to sharded RDBMs systems and was wondering why people don't go with these scalable systems from the start now that they exist.
+1. Recently, Yugabyte has seemed intriguing to me. (It seems like) You can launch it as a standalone instance and use it just like Postgres early in your company, and then when you need to distribute, you can do that.
I found that having a NoSQL distributed database for prod that stream Change Data Captures to a read-only SQL DB to be a great compromise. The SQL data was slightly stale but that wasn't an issue for most situations (like graphing data or certain error metrics)
Probably because there is a lot more operational understanding of “traditional” RDBMs and they’re sufficient in early stages. Once your product takes off it can be difficult to scale but by that time you probably have enough money to attract engineers that can do the migration.
Yes, that's the common refrain, just migrate when you hit scale. Just pay people to do the migration. My experience at multiple, billion dollar companies has been that it's never that easy.
People become afraid of doing schema migrations and tools like dynamodb become the default datastore choice for all new feature development.
Every single startup I've worked at, technical cruft that existed from company founding never disappears despite the founders intent that it was never supposed to stay that way and it slows down feature development to a crawl.
If NewSQL databases exist that support 90% of the features of postgres but give support for cross shard transactions then I'd rather use them than technology that has been proven to not scale. Unfortunately I'm not an expert on distributed databases and there is not much written about these databases in practice outside of the startups creating the databases themselves.
From my own experimentation, CockroachDB is fantastic and exactly what you just described (90%+ of the features of postgres, but transparently sharded). It is somehow new and shiny and boring old tech all at the same time.
In fact, a large part of the work of many emerging database systems, including Doris, is to supplement the missing functions compared with the traditional databases(Like Oracle), so as to help users smoothly migrate from the original system to the new system. Such as the CTE mentioned by zX41ZdbW.
At the same time, most of projects are emphasizing their performance and cost-effectiveness, because this is the first factor for user selection. For example, Doris and SelectDB (Doris Commercial Company) also participated in Clickbench to reflect their performance advantages.
But the implementations of various projects are actually converging (Doris' execution engine also refers to Clickhouse, thanks for open source), so we can only look for advantages in specific scenarios, such as high concurrency, time series, etc.
On the other hand, performance and cost-effectiveness do not really solve users' problems, so we are also thinking about what users really need besides technology.
Another interesting point is the trend of "de-distribution". Like Clickhouse and DuckDB, they can run well on a single machine with a multi-core, and can also meet the needs of many users. The reliability, availability, and scalability problems solved by "distribution" can all be solved in the PaaS of cloud vendors.
Are there any good recommendations for a lecture or series of lectures that go over Databases from a theoretical perspective but also do a practical survey of the various DB systems and technologies out there?
Oh great, it seems like this year he made the sarcasm about Larry Ellison a lot easier to pick up. Frankly I was still unsure how he really felt about Larry after his review from last year[1]. After reading this, it confirms to me that he doesn’t like Larry :).
Andy's obsession with mmap is interesting. I don't think it's the worst idea in some cases, like Prometheus. They use mmap to simplify the code and only keep the relevant (hot) parts of the index/data in memory. I guess the fact that the data is immutable helps, but that's the point; it's a good idea for some usecases and I don't agree with a blanket "mmap bad"
We used mmap extensively and it took a gigantic effort to purge it out of the codebase in SingleStore. Lots of issues the biggest one is that munmap has a mutex and you incur a lot of contention.
The second one is that mmap prevents you from a tight control of memory in various pipelining scenarios when you want to have your memory overhead to be constant and not sizeof(data).
> Lots of issues the biggest one is that munmap has a mutex and you incur a lot of contention.
But the alternative we have suffers from the same issue, doesn't it?
malloc is implemented in terms of sbrk and mmap/munmap, both of which are guarded by locks. And malloc will also have its own user-space locks on top of the ones from the kernel. I checked glibc and jemalloc implementation.
> mmap prevents you from a tight control of memory ... when you want to have your memory overhead to be constant and not sizeof(data).
I am not sure I understand this argument. It's trivial to implement the user-space custom MMAP allocator that gives you no less control over the memory than other "types" of allocators. Actually, it can give you more power because you can avoid the locks from user-space malloc implementations if you know that the use-case you're crafting it for is single-threaded. I used this technique in the past for short-living objects and in certain workloads it improved the performance by a factor.
When Andy rages about databases using mmap, he means using it to replace the buffer pool. In theory, it provides exactly what the database needs by paging the data in and out on demand. In practice this scales poorly to large number of threads due to TLB shootdowns, and as Nikita said, you give up control of memory usage (since it’s the OS managing it).
Nobody is arguing that you shouldn’t call mmap to get yourself a few anonymous virtual pages to do your work in.
I am probably missing something but I still don't see the difference given that buffer pool will otherwise be implemented in terms of malloc, and malloc is still implemented in terms of mmap so we're back to square one.
However, I could imagine that it is possible for contention in kernel-space mmap locks could be artificially relieved because of the user-space locks in malloc.
Currently I don't see how TLB shootdowns would be specific to mmap only but not to malloc but I may find some time to read the paper.
What we’re talking about is mmap-ing files vs doing some anonymous mapping and manually reading files. Anonymous mappings don’t incur all the TLB shootdowns that mmap’d files do.
I came back to MariaDB after leaving MySQL about a decade ago.
The JDBC driver is 10x as big and times-out after less than 12 hours default.
Things are not going in the right direction, but on the upside I now learned how to install MariaDB manually so installers can't brick my OS like MySQL did back in 2014 (MySQL bricked itself and no way to unbrick it except reinstalling Windows).
> Things are not going in the right direction, but on the upside I now learned how to install MariaDB manually so installers can't brick my OS like MySQL did back in 2014 (MySQL bricked itself and no way to unbrick it except reinstalling Windows).
You know, if you ever want to be able to create various throwaway environments with resource limits or port mappings easily, for any number of versions you might need to run in each of your projects.
This works similarly well for something like PostgreSQL, too! And if you need persistence, you can use either volumes or bind mounts.
What I found strange about MariaDB is that it is ~10 times slower than MySQL on OLAP queries[1]. It looks entirely wrong and most likely related to an issue with Debian packages.
I would, I even made my own ORM to be able to switch db's without friction, but the particular project I'm reviving was made before my ORM was ported to Postgres and Oracle.
Slightly off topic, would someone mind providing a down to earth summary of the benefits (or disadvantages) of dbt?
To ape the infamous iPod and Dropbox middlebrow dismissals: dbt looks like a control loop with a nice UI to make a database schema match templated SQL files.
More seriously, at what level of maturity would a company find dbt useful and which teams benefit the most?
I think it makes sense in the context of the poential of plain old SQL (and additions such as dbt) winning out vs Python or other general purpose programming solutions as the analytics workbench (although this is controversial, there's a school of thought that backs it).
> Too much blood has been shed in the database community in previous benchmark turf wars ... I even broke up with a girlfriend once because of sloppy benchmark results.
Well, better over that than over east const vs const west.
What are people's and the market take on Starburst? I haven't kept up with databases and the general market, but recently heard about them. Is there a sizeable demand and market for them? Any other big players to watch for?
Great post, and thanks for the call although i am skeptical of commodotization of OLAP systems thanks to the likes of substrait and velox. It'll take a lot of unwilling participants.
Having and independent schema file being one to one with the proto files is not quite as nice, but then you still can get bindings in each language to the proto.
The goal is to avoid defining the table schema in each language, like ORMs, where things can be out of date.
There may even be some benefits to this, as you might have some lifecycle hooks around migrations to keep the proto up to date with the current schema. You could do this with just protoc plugins to hide fields in the language bindings or you could do it turing the edgedb schema to proto translation.
Love this idea, and in general agree that using a proto model as the single source of truth is a neat concept.
Not sure you need a completely new db though; with protobuf’s native support for plugins, perhaps the missing piece you’re looking for is a protoc plugin that generates the table definitions for your db of choice? Could work well for databases with declarative schemas..
Don’t want to discourage you from creating a new database, but writing a proto plugin is arguably a quicker undertaking!
What I'm unsure of is how well proto can be modeled in various existing relational databases and what the performance and management consequences might be.
If I have a message defining a table and it has a field which is another message, how well can Postgres Structs represent that submessage? What are the performance implications? I'm not terribly familiar with the query semantics for Poatgres structs. How do migrations and backfills work for struct fields and nested structs?
These considerations may or may not justify a new database.
Yeah I’m sure there are all sorts of considerations. Also not saying it’s straightforward, just easier than writing a new db. You can for example write proto message extensions to support custom field behaviours.
But like a sibling comment said here, if you want the database implementation to dictate your models you can probably generate protos from something like EdgeDB. But then the db implementation is in charge, which seem a bit backwards.
If you’ve used graphql you’re presented with a similar challenge but on the other end of the spectrum - the graphql schema that needs to map to application models.
So one of the main considerations is to figure out how and where you want to define your source models. Is it the view layer, the model layer or the database layer.
Can we also start a general "Andy Pavlo" appreciation comment chain?
Please comment below if Andy has in any way positively improved your life or helped you further your education/career.
I can start:
- Early in 2022, I decided I wanted to learn how databases like Postgres work and build one from scratch. If it wasn't for the CMU lectures available to the public on YouTube, I don't think I would have been able to give myself anywhere near a decent education.
- This year, Andy even made the homework-grading platform available to the public, so you could submit solutions to the database coursework and get graded as if you were a student attending CMU.
- Not only does Andy take the time to record, edit, and upload his lectures to the public, but he hosts a yearly seminar series where he interviews speakers from all over the database industry (https://db.cs.cmu.edu/seminar2022)
- Andy does all he does (Professorship) on top of running OtterTune, I have no idea how he functions. He truly is the realest of all gangsters, and a secret member of the Wu-Tang Clan. Andy makes databases fun.
Can you point me to the homework-grading platform ? Can one use it outside of standard course schedule ? It would be great to do the task in my own pace.
Also, I should mention there is a very large (unofficial) Discord server (+1,000 people) where some of the TA's from this DB course and myself hang out if anyone is interested:
And if you find database engineering interesting in general, I might also invite you to come hang out in Phil Eaton's hacker-space Discord, which is quite good:
Seems that I can’t access it without a gradescope account. I can’t create such account because I’m not a student anymore. What a pity, I would love to do those assignments and see their results.
Which part are you referring to? I can't get through his highly opinionated and dismissive writing. Plus he refers to his daughter as his "biological daughter" which is super awkward and weird. I keep getting distracted by it. It comes across like it's supposed to be a passive-aggressive comment about his step-daughter or adopted daughter.
If you clicked on the “biological daughter” link, you’d see it links out to a Twitter post where Pavlo shares a paternity test he had to take for some reason. So it’s just a joke, not a commentary on other children he may or may not have.
Thanks for the context. Sounds like he's being passive-aggressive towards the parties involved that required the test for some reason. Now it's not just weird, but comes across as oddly petty for someone that is an adult that's teaching a bunch of younger people. Not sure what it is, but the way the guy is operating is a big turn off in a lot of ways.
https://supabase.com/
pg_graphql - https://supabase.com/blog/pg-graphql-v1
pg_jsonschema - https://supabase.com/blog/pg-jsonschema-a-postgres-extension...
Postgres WASM - https://supabase.com/blog/postgres-wasm
..And other open-source features.