Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Sled: Embedded Database Written in Rust (github.com/spacejam)
198 points by adamnemecek on Feb 20, 2020 | hide | past | favorite | 48 comments


There was an awesome talk very recently about Sled at FOSDEM, highly recommend.

https://fosdem.org/2020/schedule/event/rust_techniques_sled/

Talk breif:

sled and rio

modern database engineering with io_uring

sled is an embedded database that takes advantage of modern lock-free indexing and flash-friendly storage. rio is a pure-rust io_uring library unlocking the linux kernel's new asynchronous IO interface. This short talk will cover techniques that have been used to take advantage of modern hardware and kernels while optimizing for long term developer happiness in a complex, correctness-critical Rust codebase.


Good to see some more activity in embedded data space. Earlier used BerkleyDB [1], msql [2] (from where MySQL was built), Postgres 4.2 [3] and now settled on SQLite for all our embedded database needs. When need performance at present just use SQLite in memory db.

Will give Sled a try when it reaches 1.0.0 as at present SQLite serves our embedded requirements.

[1] https://launchpad.net/berkeley-db

[2] https://hughes.com.au/products/msql/

[3] https://dsf.berkeley.edu/postgres.html


Sled is not really a competitor to those.

Sled is an alternative to other on-disk key/value databases like RocksDB and LMDB.


Make sure to check out lmdb if you haven’t. It’s not relational and has an extremely anemic API but it makes a great low-level building block. Sled is similar I’m that regard, very different from the others you have named.


I've seen some startling performance claims about LMDB, but don't know anyone using it.

Why isn't LMDB used as an engine for e.g. MySQL? Why, for example, did Facebook go with LevelDB in MyRocks instead of LMDB?


My company uses LMDB extensively. LevelDB is a LSM database. LSM is generally better for write workloads over BTree DBs like LMDB and sled. LMDB also has a single writer restriction.


BIND9 can use LMDB for storing the configuration of dynamically added zones. https://www.isc.org/blogs/bind-release-911/


Apparently if you're on Windows you don't deserve performance:

    #[cfg(windows)]
    const MAX_THREADS: usize = 16;


    #[cfg(not(windows))]
    const MAX_THREADS: usize = 128;
From: https://github.com/spacejam/sled/blob/bf37bd120fbf62f78408e0...


also great commit message: https://github.com/spacejam/sled/commit/b7a5d14399540daa433a...

the limit to 16 threads under windows was added in a commit named "refactor imports".

If somebody in the future tries to find out why the threads are limited to 16 under windows, they will eventually end up at that commit and if the committer isn't around any more (or doesn't remember), nobody will ever know the reasoning.

For projects that have even the slightest chance of staying around, be mindful about your commit messages and explain why you are doing something, not what you are doing - that's what the code itself is for.


I would also argue for a short comment in the code for things like this. Code gets shared and the version control doesn’t always come with it. I’m mostly bitten by this when using some vendor’s SDK where they release a tarball of the code but keep the repo private. But even within your own organization, code will get copy/pasted to another repo and the comment is more likely to go with it.


> be mindful about your commit messages and explain why you are doing something, not what you are doing

It's sad to see how many experienced programmers do not understand this. Of course, it takes a little more effort, but without this type of thinking you might as well skip the commit message completely. I am a very junior programmer and it was one of the first thing I was corrected for not doing, and rightly so.


Code for your future self in 6 months time...


spacejam (Author) comments:

for the hn h8rz: you don't deserve performance bwahahhahaha maybe this should actually be dropped to 0 olololololololol

https://github.com/spacejam/sled/commit/b7a5d14399540daa433a...


I've seen this kind of thing many times before: the developer wants to commit something (in this case, an import refactor), and accidentally commits other changes which were being worked on. This is why I don't recommend getting into the habit of using "git commit -a".


absolutely. I'm telling people to always use `git add -p` to help them produce self-contained commits.

And if you screw something up, there's always `git commit --amend` or the sledgehammer of `git rebase -i`.

Note: I'm not advocating doing this to public history, but I'm saying: Use these tools in order to create self-contained changes on your end before committing them to public history.

When I'm doing archeology trying to find out why Foo is doing Bar instead of Baz, I don't want to know that back then, 10 years ago, you fixed a typo or added a "forgotten file".


Well threadripper tests showed windows does not scale that high.


It does, but the client code needs NUMA support, which Rust does not currently have.

It doesn't help that the default behaviour in Windows is just stupid:

"By default, an application is constrained to a single group, which should provide ample processing capability for the typical application."

This is the new "640KB should be enough for everybody", except now it is "64 threads should be enough".

See:

https://docs.microsoft.com/en-us/windows/win32/procthread/pr...

https://www.anandtech.com/show/15483/amd-threadripper-3990x-...


Which surprises me, because formal object ownership and read/write access sound like exactly the sorts of things you’d like to have for a NUMA machine.


> 64 threads should be enough.

It’s more that if you have a purely sequential workflow, you get access to less than 2% of the CPU on a 64 core machine.

And even if you’re a bit parallel, Amdahl’s law smacks you pretty hard here.


The L3 cache is shared and has a mindblowing size, let's not forget that.

Plus you get way more thermal room.

Plus you can disable SMT.

And there is always more than 1 process running if you take into account kernel threads.

So I would say way quite a lot more than 2%. More like 10%.


> a lot more than 2%. More like 10%.

That's still essentially single digit percentages. And if is really in the low double digits, you're back to single digits when you go to 80 cores.

On the positive side, I don't think multi-user systems would be where they are today if we hadn't been staring at a future full of hardware that can only be fully utilized by embarrassingly parallel tasks.


Sounds like a really sensible default to me.


only narcs use windows for database workloads :P


If performance was something that was important to you, why would you even choose windows as your OS?


Ignoring the obvious that not everybody gets to choose the operating system their enterprise uses, there's real reasons to use Windows for performance.

For one, SQL Server spanks most other databases for complex ad-hoc queries.

Don't get me wrong, PostgreSQL is getting a lot better these days, and I'm sure that for a lot of basic workloads its very competitive, but the query planner in SQL Server is basically black magic.

There are very few (any?) database engines out there that can, in a single query, combine in-memory tables, remote tables, functions, materalised views, columnar data, and recursive joins then run that query in parallel and efficiently.

I think Oracle has most of those checkbox features, and perhaps a couple of others do too (DB2 perhaps?), but as far as I know MS SQL Server still outperforms them for this type of thing.

Mind you, Oracle scales to bigger clusters, and DB2 runs on bigger iron, but that's not my use-case. I have one Intel box.


> Ignoring the obvious that not everybody gets to choose the operating system their enterprise uses

For sure. If you can't choose, then I don't blame you.

> there's real reasons to use Windows for performance

I can't really think of any reason though. You gave the example of SQLServer, but that also runs on Linux. Does SQL Server perform better on Windows than Linux?

Edit: found a benchmark here: https://www.dbbest.com/blog/running-sql-server-on-linux/

Turns out you get better performance for SQL Server on Linux compared to windows. I am not surprised by this at all.


I'd love to see someone go for a "exo-db" architecture, where the layers are broken out into distinct crates. Expressiveness making this possible should be a main benefit of Rust on par with safety.

c.f. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.435...



> If you want to store numerical keys in a way that will play nicely with sled's iterators and ordered operations, please remember to store your numerical items in big-endian form. Little endian (the default of many things) will often appear to be doing the right thing until you start working with more than 256 items (more than 1 byte), causing lexicographic ordering of the serialized bytes to diverge from the lexicographic ordering of their deserialized numerical form.

While I can understand that big endian ordering is convenient for size-indifferent lexical ordering, is the tradeoff worth it? Yes, your benchmarks will look better, but real-world usage will require clients to continually byte swap. So you've basically forced a performance drop into user code. Also, wouldn't size-aware compares perform better in the generated machine code anyway?


Turning around 8 bytes is a miniscule cost compared to the cost of having a badly formed index. There really is no way around data not being big endian.


yeah, le-be conversions are not generally measurable above noise compared to other database-related work.

sled stores arbitrary bytes. endianness is the concern of the person who wants to store higher-level types than bytes, like integers. I do imagine having a story for letting people deserialize/view bytes once, and having that view sit in cache so that hot items are not repeatedly deserialized.

I agree with Adya's work " Fast key-value stores: An idea whose time has come and gone " https://research.google/pubs/pub48030/ where it makes the case that stateful systems should not have to pay repeated deserialization and network costs. I want sled to be well-situated for the world that we're headed into, where this view will become more prominent. This means caching deserialized data, better replication stories, and maybe nice helpers for distributed database authors that allow for atomic tree splits / merges and per-tree replication settings.


Not sure if this this still relevant, but sled had a memory leak that precluded us from using it for an actual production project 3 months ago:

https://github.com/panicfarm/sltest


On sled 0.31 the memory growth does not run away as with the earlier version. Note that some in-memory metadata tracking does accrue as nodes split and sled tracks where to find them in the future when the actual data is paged in. But this is still a fraction of data stored.

The important thing is that sled is not stable yet. Don't bet your business on databases less than 5 years old. Sled doesn't turn 5 for a few more months, and I'm ironing out a few issues still that relate to production readiness. For now, treat it as a cache, as a responsible SRE would treat any database younger than 5 years old.


The reproducible bug is great, and I went looking to see if it might have been fixed, but I didn't find the same bug in the list.

Did you let sled know about the bug? Can you point me in that direction?


Yes, I wrote in sled's discord on Nov 14 and got this from the author:

hey @panicfarm, thanks for the reproduction code! I think this is the same issue that kerugma ran into and mentioned a couple days ago. I'm curious if it still happens on sled master? I just did a huge refactor and sometimes a bunch of issues just shake out. Anyway, I will turn this into a proper regression test and get the page_out method of PageCache to properly be called, which is the underlying root.


What kinds of systems is this used in? I'm assuming it's not for embedded systems at the microcontroller level?


No this would be used in places where you would use SQLite or rocksdb


Are there databases for embedded applications, such as ARM microcontrollers?


I've used ports of BerkleyDB several times for things down at that level. It isn't very powerful, but it's enough you can use it for most microcontroller needs.



Probably SQLite if not just flat files


Putting SQLite onto a controller that isn't running a full OS is actually a lot of work. [0] There isn't a working RTOS port out there, last I checked.

[0] https://sqlite.org/custombuild.html


Well yeah, depends on the boards limitations.


Why no Tree<K,V>? It would be very nice to say up front one wants fixed-sized keys and values and get the benefits.

This whole hurr drrr arbitrary data k v thing never made sense to me. Truly Unstructured data is meaningless--that's information theory for you---so why not leverage some types?


The best solution is to use a 0-copy format like flatbuffers, google's zerocopy library, or serde's borrow attributes https://serde.rs/lifetimes.html#borrowing-data-in-a-derived-...

You're totally right that data is meaningless without the ability to use it. This is something I've been thinking about a lot, because it involves a lot of trade-offs that definitely were not apparent to me before I started building this.

Ultimately, I do want to make it easy to write a Tree<K, V> on top of sled, but if I were to do it myself it would significantly reduce performance and impose restrictions on users that don't feel appropriate to me.

* Being a lock-free B+ tree, there are a ton of cool techniques we can use on bytes that would not apply to arbitrary K types. For instance, all keys stored in a tree node are prefix encoded by the node's low key. This allows very long keys with common prefixes to be very cheaply stored, which is useful for things like F1-like embedded tables. We also do key truncation, where when we do a node split, we chop off the bytes necessary to actually differentiate between one side and the other, which further reduces the size of the index. Prefix encoding and suffix truncation are the bread and butter of modern B+ tree implementations, and that would hurt a lot to walk away from.

* All K and V types must then be serializable and deserializable. In rust we represent this with traits that come from specific external libraries. serde took up half of the sled compilation time when it was being used, so I don't want to pull that back in as a mandatory requirement for all users. I could have my own traits for this, and if I do anything like this, it will be the approach I take, and then external crates will implement serde-sled etc... but by depending on external crates, it makes for a very brittle API surface that requires all users to target the exact same version of that external trait. The core must be self-consistent and not export any external types that would cause sharp dependency issues.

* I like the idea of caching values that are deserialized only once, to avoid repeated deserialization costs on hot items. Something like this may be implemented in sled, but I haven't figured out the best way to represent it without causing memory issues etc... And in the mean time I'm more tempted to just point people to the 3 solutions above that let them view their bytes as structured data without many deserialization costs.


In the spirit of my other comment https://news.ycombinator.com/item?id=22375979 it would be nice to say the layers broken out into separately usable abstractions. Ideally, this means layering, K and V type params, and no loss of performance.

> Prefix encoding and suffix truncation

At the very least, you can do all this stuff in the key is like [u8]/Vec<u8> case. But maybe also [T]/Vec<T>? I love to look at my existing monomorphic algorithms, and think "what traits would this require to make this as abstract as possible"? Almost never is the answer "sorry, it cannot be generalized at all".

> All K and V types must then be serializable and deserializable....and if I do anything like this, it will be the approach I take.....

I agree with your own traits. Per the above your algorithms come first, not other people's abstractions. But I don't know why you can't just do that and ignore serde entirely. Serde becomes someone else's problem, and if they don't want to bother with it you will provide the [u8]/Vec<u8> instances.

> And in the mean time I'm more tempted to just point people to the 3 solutions above that let them view their bytes as structured data without many deserialization costs.

Agreed. If you data structure has a pointer, that should be a foreign key in my book (though not necessarily in the same direction!). It's best if you can just memcopy the row, more or less.

[I say the pointer thing not only because marshaling cost, but also data modeling. Huge rows / json blogs just don't make sense to me. There almost surely is some structure to the nested data, and that deserves to be formalized and enforced in its own index. Ignore what they told you in SQL class. Indexes express/reify invariants, and foreign keys act through indices.]


An application could easily create their own Tree<K,V> with serde or binencode or something else.

A key/value database doesn't really want to know about an application's types, because that complicates storage algorithms because sorting is now much more complex. In the same way, asm doesn't really have data types. All you have are bytes and addresses. A database needs to be able to use the encoding that it's given, rather than create another type system underneath it.


> A key/value database doesn't really want to know about an application's type

Relational DBs have for decades.

> Because that complicates storage algorithms because sorting is now much more complex

Leveraging the types to build better indices is huge. Different data admits different total orders / partial order / lattices / other mathematical structure. If you generalize the math to infinitely large tables, this is not the difference between certain queries existing or not existing, it is the deference between queries existing or not existing. One might say limit of algorithmic complexity is realizability https://ncatlab.org/nlab/show/realizability.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: