> One path for fixing this would be to move the sql query parser inside the client library. Then internally, have the client library send optimized RPC instructions to the database. This would take cpu load off the database (usually a good move - databases are harder to scale). And it would open the door for the client library to provide other better, lighter and faster ways to programmatically construct database queries. Ideally without the totally unnecessary loop of constructing then parsing sql.
Erm, that’s already true. Proper clients will be parsing SQL locally and sending representations of the query, and its parameters separately. They may often be using a wire-format that’s very close to textual SQL, but parsing that SQL is hardly the slow part of executing a query.
DB CPU load doesn’t come from parsing SQL, it comes from the query planning process, and the actual work of reading and filtering data. The planning process alone is extremely complex as the DB will be using all manner of statistical data collected about the type of data stored, and exactly how it’s stored, and how it various columns do or don’t correlate with each other, in order to estimate the best possible way of performing constructing the query plan. A process that factors in additional elements like the relative speeds of reading from disk, vs memory, and CPU cache, and how that varies with working set size.
In addition, the real CPU load comes from the actual execution of that query. Streaming data off disks into memory, building various temporary in-memory data structures to accelerate joins and filters, there’s a lot of bytes that need to be shuffled around, and modern DB codebase are now optimising for the number of CPU cycles needed to operate tight inner loops. None of this work can be moved to the client, not unless you have a mechanism of streaming GB/s of data to your client on every query.
To think of SQL as just an RPC protocol completely misses the point of SQL, or the incredible engineering involved in modern query planners. There a many reasons to replace SQL if something better, but the idea that it’s an inefficient RPC protocol, and that clients should have more fine-grained control of the planning and execution of queries, really isn’t one of them.
> To think of SQL as just an RPC protocol completely misses the point of SQL, or the incredible engineering involved in modern query planners. There a many reasons to replace SQL if something better, but the idea that it’s an inefficient RPC protocol, and that clients should have more fine-grained control of the planning and execution of queries, really isn’t one of them.
But SQL is used as an RPC protocol & wire format for getting data in and out of the database. Basically all web apps built on top of a sql database craft SQL strings for SELECT and UPDATE commands. The queries are very repetitive - since they're crafted once and run with each web request. I'm sure there's a lot of unnecessary work related to doing that over and over again constantly - and if that work isn't parsing strings, its generating the query plan. The format is also really awkward to use. SQL doesn't play very nice with most modern languages - since you generally either write the SQL string by hand, and then the types are opaque to the language. Or you get an ORM to do it - and then you can't easily customize the SQL string itself.
Its complicated, but one approach to solving this would be to bundle up all that query planning code into a wasm blob and send it to database clients when they connect. The blob would contain any complex statistical code that the database uses. This would allow applications to create and save a query plan from a query string, and reuse it with subsequent queries. The wasm bundle would also contain all the binary serialization & deserialization code. Databases could then continue to evolve their query planners as they do today - but performance would be much better. And it would open the door to richer, lower level APIs for applications to customize & understand the query plans. And, I'm sure, figure out what types will be expected and returned from any given query.
> This would allow applications to create and save a query plan from a query string, and reuse it with subsequent queries.
You can already do this. They’re called prepared queries. The application sends the query once, and asks the DB to create a prepared query, and every time the application wants to reuse it, it just needs to pass the query parameters. The DB reuses the already existing prepared query, and often the query plan is reused as well. But that depends on the exact DB implementation.
You also can’t just reuse the same query plan over and over again. The whole point of the query planning process is to optimise the query plan to current state of the data on disk. As that data is appended to and updated, the optimal query plan may change.
It’s also possible for applications to know exactly what types a query is going to return. That information is already returned by the database as part of the query response, it has to be for the client to correctly deserialise the returned data. Most DB uses a binary protocol for most of their clients (but also support a pure text based protocol), and deserialisation the binary response requires information about the response types.
I think you’re drastically underestimating just how complex and magical the inside of a real SQL database is. They represent literally decades of engineering and research into the best ways of storing and querying databases. If anything close to what you’re suggesting was a good idea, someone would have tried it.
> I think you’re drastically underestimating just how complex and magical the inside of a real SQL database is. They represent literally decades of engineering and research into the best ways of storing and querying databases. If anything close to what you’re suggesting was a good idea, someone would have tried it.
I was with you up until the last sentence. I almost certainly am underestimating how magic the internals of databases are, and how much work has gone into them. But there is a mountain of good ideas in CS that simply haven't been tried because of convenience and momentum.
For example, SeL4's microkernel architecture is brilliant - but it'll probably never make it into Linux because of simple inertia. (Its too hard to rearchitect linux like that, and nobody really wants to port all their working software to L4.)
One reason that this approach to database design probably hasn't been tried before is that there's dozens of database client driver implementations. Moving any logic into the DB client would require all of that logic to be reimplemented a lot of times, and thats a lot of inconvenient work. But webassembly changes that calculus - since the database itself can hand a wasm blob to the client to run. Thats a very new capacity that was really hard to pull off a few years ago. (Foundationdb does something similar - the client library is a native library that needs bindings on every language / platform to work - but its incredibly inconvenient to use as a result. And a native apple silicon version of the library took years to come out, even though its maintained by Apple.)
> One reason that this approach to database design probably hasn't been tried before is that there's dozens of database client driver implementations.
Or it could be that the added complexity simply doesn’t provide any benefits. What benefits would splitting up the query planner between the DB and its clients provide? Computing a query plan isn’t the expensive part of executing a query.
You’re suggesting that it’s a good idea for databases to expose their core internal APIs, which are tightly coupled to their underlying storage subsystems, to some kind of magic query planner that runs in the client. Basically ensuring that you can’t ever modify those APIs without breaking every existing client out there. Effectively you would force every DB to require their clients to be updated in perfect lock-step with the DB itself. All to achieve what? A saving of a few milliseconds on un-prepared queries?
Simply put, there are much more sensible ways to reducing the cost of query parsing and planning. As it happens every DB worth talking about already implements those optimisation in the form of prepared queries. What benefit would moving where the query planner is executed provide?
I would also highlight that if you go an look at DBs built from the ground up for massive scale and parallelisation, they still run their query planners on a single node to figure out how best to break up a query for distributed execution. Nobody is attempting to distribute the actual query planning part of a DB.
Erm, that’s already true. Proper clients will be parsing SQL locally and sending representations of the query, and its parameters separately. They may often be using a wire-format that’s very close to textual SQL, but parsing that SQL is hardly the slow part of executing a query.
DB CPU load doesn’t come from parsing SQL, it comes from the query planning process, and the actual work of reading and filtering data. The planning process alone is extremely complex as the DB will be using all manner of statistical data collected about the type of data stored, and exactly how it’s stored, and how it various columns do or don’t correlate with each other, in order to estimate the best possible way of performing constructing the query plan. A process that factors in additional elements like the relative speeds of reading from disk, vs memory, and CPU cache, and how that varies with working set size.
In addition, the real CPU load comes from the actual execution of that query. Streaming data off disks into memory, building various temporary in-memory data structures to accelerate joins and filters, there’s a lot of bytes that need to be shuffled around, and modern DB codebase are now optimising for the number of CPU cycles needed to operate tight inner loops. None of this work can be moved to the client, not unless you have a mechanism of streaming GB/s of data to your client on every query.
To think of SQL as just an RPC protocol completely misses the point of SQL, or the incredible engineering involved in modern query planners. There a many reasons to replace SQL if something better, but the idea that it’s an inefficient RPC protocol, and that clients should have more fine-grained control of the planning and execution of queries, really isn’t one of them.