sled is an embedded database that takes advantage of modern lock-free indexing and flash-friendly storage. rio is a pure-rust io_uring library unlocking the linux kernel's new asynchronous IO interface. This short talk will cover techniques that have been used to take advantage of modern hardware and kernels while optimizing for long term developer happiness in a complex, correctness-critical Rust codebase.
Good to see some more activity in embedded data space. Earlier used BerkleyDB [1], msql [2] (from where MySQL was built), Postgres 4.2 [3] and now settled on SQLite for all our embedded database needs. When need performance at present just use SQLite in memory db.
Will give Sled a try when it reaches 1.0.0 as at present SQLite serves our embedded requirements.
Make sure to check out lmdb if you haven’t. It’s not relational and has an extremely anemic API but it makes a great low-level building block. Sled is similar I’m that regard, very different from the others you have named.
My company uses LMDB extensively. LevelDB is a LSM database. LSM is generally better for write workloads over BTree DBs like LMDB and sled. LMDB also has a single writer restriction.
the limit to 16 threads under windows was added in a commit named "refactor imports".
If somebody in the future tries to find out why the threads are limited to 16 under windows, they will eventually end up at that commit and if the committer isn't around any more (or doesn't remember), nobody will ever know the reasoning.
For projects that have even the slightest chance of staying around, be mindful about your commit messages and explain why you are doing something, not what you are doing - that's what the code itself is for.
I would also argue for a short comment in the code for things like this. Code gets shared and the version control doesn’t always come with it. I’m mostly bitten by this when using some vendor’s SDK where they release a tarball of the code but keep the repo private. But even within your own organization, code will get copy/pasted to another repo and the comment is more likely to go with it.
> be mindful about your commit messages and explain why you are doing something, not what you are doing
It's sad to see how many experienced programmers do not understand this. Of course, it takes a little more effort, but without this type of thinking you might as well skip the commit message completely. I am a very junior programmer and it was one of the first thing I was corrected for not doing, and rightly so.
I've seen this kind of thing many times before: the developer wants to commit something (in this case, an import refactor), and accidentally commits other changes which were being worked on. This is why I don't recommend getting into the habit of using "git commit -a".
absolutely. I'm telling people to always use `git add -p` to help them produce self-contained commits.
And if you screw something up, there's always `git commit --amend` or the sledgehammer of `git rebase -i`.
Note: I'm not advocating doing this to public history, but I'm saying: Use these tools in order to create self-contained changes on your end before committing them to public history.
When I'm doing archeology trying to find out why Foo is doing Bar instead of Baz, I don't want to know that back then, 10 years ago, you fixed a typo or added a "forgotten file".
Which surprises me, because formal object ownership and read/write access sound like exactly the sorts of things you’d like to have for a NUMA machine.
That's still essentially single digit percentages. And if is really in the low double digits, you're back to single digits when you go to 80 cores.
On the positive side, I don't think multi-user systems would be where they are today if we hadn't been staring at a future full of hardware that can only be fully utilized by embarrassingly parallel tasks.
Ignoring the obvious that not everybody gets to choose the operating system their enterprise uses, there's real reasons to use Windows for performance.
For one, SQL Server spanks most other databases for complex ad-hoc queries.
Don't get me wrong, PostgreSQL is getting a lot better these days, and I'm sure that for a lot of basic workloads its very competitive, but the query planner in SQL Server is basically black magic.
There are very few (any?) database engines out there that can, in a single query, combine in-memory tables, remote tables, functions, materalised views, columnar data, and recursive joins then run that query in parallel and efficiently.
I think Oracle has most of those checkbox features, and perhaps a couple of others do too (DB2 perhaps?), but as far as I know MS SQL Server still outperforms them for this type of thing.
Mind you, Oracle scales to bigger clusters, and DB2 runs on bigger iron, but that's not my use-case. I have one Intel box.
> Ignoring the obvious that not everybody gets to choose the operating system their enterprise uses
For sure. If you can't choose, then I don't blame you.
> there's real reasons to use Windows for performance
I can't really think of any reason though. You gave the example of SQLServer, but that also runs on Linux. Does SQL Server perform better on Windows than Linux?
I'd love to see someone go for a "exo-db" architecture, where the layers are broken out into distinct crates. Expressiveness making this possible should be a main benefit of Rust on par with safety.
> If you want to store numerical keys in a way that will play nicely with sled's iterators and ordered operations, please remember to store your numerical items in big-endian form. Little endian (the default of many things) will often appear to be doing the right thing until you start working with more than 256 items (more than 1 byte), causing lexicographic ordering of the serialized bytes to diverge from the lexicographic ordering of their deserialized numerical form.
While I can understand that big endian ordering is convenient for size-indifferent lexical ordering, is the tradeoff worth it? Yes, your benchmarks will look better, but real-world usage will require clients to continually byte swap. So you've basically forced a performance drop into user code. Also, wouldn't size-aware compares perform better in the generated machine code anyway?
Turning around 8 bytes is a miniscule cost compared to the cost of having a badly formed index. There really is no way around data not being big endian.
yeah, le-be conversions are not generally measurable above noise compared to other database-related work.
sled stores arbitrary bytes. endianness is the concern of the person who wants to store higher-level types than bytes, like integers. I do imagine having a story for letting people deserialize/view bytes once, and having that view sit in cache so that hot items are not repeatedly deserialized.
I agree with Adya's work " Fast key-value stores: An idea whose time has come and gone " https://research.google/pubs/pub48030/ where it makes the case that stateful systems should not have to pay repeated deserialization and network costs. I want sled to be well-situated for the world that we're headed into, where this view will become more prominent. This means caching deserialized data, better replication stories, and maybe nice helpers for distributed database authors that allow for atomic tree splits / merges and per-tree replication settings.
On sled 0.31 the memory growth does not run away as with the earlier version. Note that some in-memory metadata tracking does accrue as nodes split and sled tracks where to find them in the future when the actual data is paged in. But this is still a fraction of data stored.
The important thing is that sled is not stable yet. Don't bet your business on databases less than 5 years old. Sled doesn't turn 5 for a few more months, and I'm ironing out a few issues still that relate to production readiness. For now, treat it as a cache, as a responsible SRE would treat any database younger than 5 years old.
Yes, I wrote in sled's discord on Nov 14 and got this from the author:
hey @panicfarm, thanks for the reproduction code! I think this is the same issue that kerugma ran into and mentioned a couple days ago. I'm curious if it still happens on sled master? I just did a huge refactor and sometimes a bunch of issues just shake out. Anyway, I will turn this into a proper regression test and get the page_out method of PageCache to properly be called, which is the underlying root.
I've used ports of BerkleyDB several times for things down at that level. It isn't very powerful, but it's enough you can use it for most microcontroller needs.
Putting SQLite onto a controller that isn't running a full OS is actually a lot of work. [0] There isn't a working RTOS port out there, last I checked.
Why no Tree<K,V>? It would be very nice to say up front one wants fixed-sized keys and values and get the benefits.
This whole hurr drrr arbitrary data k v thing never made sense to me. Truly Unstructured data is meaningless--that's information theory for you---so why not leverage some types?
You're totally right that data is meaningless without the ability to use it. This is something I've been thinking about a lot, because it involves a lot of trade-offs that definitely were not apparent to me before I started building this.
Ultimately, I do want to make it easy to write a Tree<K, V> on top of sled, but if I were to do it myself it would significantly reduce performance and impose restrictions on users that don't feel appropriate to me.
* Being a lock-free B+ tree, there are a ton of cool techniques we can use on bytes that would not apply to arbitrary K types. For instance, all keys stored in a tree node are prefix encoded by the node's low key. This allows very long keys with common prefixes to be very cheaply stored, which is useful for things like F1-like embedded tables. We also do key truncation, where when we do a node split, we chop off the bytes necessary to actually differentiate between one side and the other, which further reduces the size of the index. Prefix encoding and suffix truncation are the bread and butter of modern B+ tree implementations, and that would hurt a lot to walk away from.
* All K and V types must then be serializable and deserializable. In rust we represent this with traits that come from specific external libraries. serde took up half of the sled compilation time when it was being used, so I don't want to pull that back in as a mandatory requirement for all users. I could have my own traits for this, and if I do anything like this, it will be the approach I take, and then external crates will implement serde-sled etc... but by depending on external crates, it makes for a very brittle API surface that requires all users to target the exact same version of that external trait. The core must be self-consistent and not export any external types that would cause sharp dependency issues.
* I like the idea of caching values that are deserialized only once, to avoid repeated deserialization costs on hot items. Something like this may be implemented in sled, but I haven't figured out the best way to represent it without causing memory issues etc... And in the mean time I'm more tempted to just point people to the 3 solutions above that let them view their bytes as structured data without many deserialization costs.
In the spirit of my other comment https://news.ycombinator.com/item?id=22375979 it would be nice to say the layers broken out into separately usable abstractions. Ideally, this means layering, K and V type params, and no loss of performance.
> Prefix encoding and suffix truncation
At the very least, you can do all this stuff in the key is like [u8]/Vec<u8> case. But maybe also [T]/Vec<T>? I love to look at my existing monomorphic algorithms, and think "what traits would this require to make this as abstract as possible"? Almost never is the answer "sorry, it cannot be generalized at all".
> All K and V types must then be serializable and deserializable....and if I do anything like this, it will be the approach I take.....
I agree with your own traits. Per the above your algorithms come first, not other people's abstractions. But I don't know why you can't just do that and ignore serde entirely. Serde becomes someone else's problem, and if they don't want to bother with it you will provide the [u8]/Vec<u8> instances.
> And in the mean time I'm more tempted to just point people to the 3 solutions above that let them view their bytes as structured data without many deserialization costs.
Agreed. If you data structure has a pointer, that should be a foreign key in my book (though not necessarily in the same direction!). It's best if you can just memcopy the row, more or less.
[I say the pointer thing not only because marshaling cost, but also data modeling. Huge rows / json blogs just don't make sense to me. There almost surely is some structure to the nested data, and that deserves to be formalized and enforced in its own index. Ignore what they told you in SQL class. Indexes express/reify invariants, and foreign keys act through indices.]
An application could easily create their own Tree<K,V> with serde or binencode or something else.
A key/value database doesn't really want to know about an application's types, because that complicates storage algorithms because sorting is now much more complex. In the same way, asm doesn't really have data types. All you have are bytes and addresses. A database needs to be able to use the encoding that it's given, rather than create another type system underneath it.
> A key/value database doesn't really want to know about an application's type
Relational DBs have for decades.
> Because that complicates storage algorithms because sorting is now much more complex
Leveraging the types to build better indices is huge. Different data admits different total orders / partial order / lattices / other mathematical structure. If you generalize the math to infinitely large tables, this is not the difference between certain queries existing or not existing, it is the deference between queries existing or not existing. One might say limit of algorithmic complexity is realizability https://ncatlab.org/nlab/show/realizability.
https://fosdem.org/2020/schedule/event/rust_techniques_sled/
Talk breif:
sled and rio
modern database engineering with io_uring
sled is an embedded database that takes advantage of modern lock-free indexing and flash-friendly storage. rio is a pure-rust io_uring library unlocking the linux kernel's new asynchronous IO interface. This short talk will cover techniques that have been used to take advantage of modern hardware and kernels while optimizing for long term developer happiness in a complex, correctness-critical Rust codebase.