You're totally right that data is meaningless without the ability to use it. This is something I've been thinking about a lot, because it involves a lot of trade-offs that definitely were not apparent to me before I started building this.
Ultimately, I do want to make it easy to write a Tree<K, V> on top of sled, but if I were to do it myself it would significantly reduce performance and impose restrictions on users that don't feel appropriate to me.
* Being a lock-free B+ tree, there are a ton of cool techniques we can use on bytes that would not apply to arbitrary K types. For instance, all keys stored in a tree node are prefix encoded by the node's low key. This allows very long keys with common prefixes to be very cheaply stored, which is useful for things like F1-like embedded tables. We also do key truncation, where when we do a node split, we chop off the bytes necessary to actually differentiate between one side and the other, which further reduces the size of the index. Prefix encoding and suffix truncation are the bread and butter of modern B+ tree implementations, and that would hurt a lot to walk away from.
* All K and V types must then be serializable and deserializable. In rust we represent this with traits that come from specific external libraries. serde took up half of the sled compilation time when it was being used, so I don't want to pull that back in as a mandatory requirement for all users. I could have my own traits for this, and if I do anything like this, it will be the approach I take, and then external crates will implement serde-sled etc... but by depending on external crates, it makes for a very brittle API surface that requires all users to target the exact same version of that external trait. The core must be self-consistent and not export any external types that would cause sharp dependency issues.
* I like the idea of caching values that are deserialized only once, to avoid repeated deserialization costs on hot items. Something like this may be implemented in sled, but I haven't figured out the best way to represent it without causing memory issues etc... And in the mean time I'm more tempted to just point people to the 3 solutions above that let them view their bytes as structured data without many deserialization costs.
In the spirit of my other comment https://news.ycombinator.com/item?id=22375979 it would be nice to say the layers broken out into separately usable abstractions. Ideally, this means layering, K and V type params, and no loss of performance.
> Prefix encoding and suffix truncation
At the very least, you can do all this stuff in the key is like [u8]/Vec<u8> case. But maybe also [T]/Vec<T>? I love to look at my existing monomorphic algorithms, and think "what traits would this require to make this as abstract as possible"? Almost never is the answer "sorry, it cannot be generalized at all".
> All K and V types must then be serializable and deserializable....and if I do anything like this, it will be the approach I take.....
I agree with your own traits. Per the above your algorithms come first, not other people's abstractions. But I don't know why you can't just do that and ignore serde entirely. Serde becomes someone else's problem, and if they don't want to bother with it you will provide the [u8]/Vec<u8> instances.
> And in the mean time I'm more tempted to just point people to the 3 solutions above that let them view their bytes as structured data without many deserialization costs.
Agreed. If you data structure has a pointer, that should be a foreign key in my book (though not necessarily in the same direction!). It's best if you can just memcopy the row, more or less.
[I say the pointer thing not only because marshaling cost, but also data modeling. Huge rows / json blogs just don't make sense to me. There almost surely is some structure to the nested data, and that deserves to be formalized and enforced in its own index. Ignore what they told you in SQL class. Indexes express/reify invariants, and foreign keys act through indices.]
You're totally right that data is meaningless without the ability to use it. This is something I've been thinking about a lot, because it involves a lot of trade-offs that definitely were not apparent to me before I started building this.
Ultimately, I do want to make it easy to write a Tree<K, V> on top of sled, but if I were to do it myself it would significantly reduce performance and impose restrictions on users that don't feel appropriate to me.
* Being a lock-free B+ tree, there are a ton of cool techniques we can use on bytes that would not apply to arbitrary K types. For instance, all keys stored in a tree node are prefix encoded by the node's low key. This allows very long keys with common prefixes to be very cheaply stored, which is useful for things like F1-like embedded tables. We also do key truncation, where when we do a node split, we chop off the bytes necessary to actually differentiate between one side and the other, which further reduces the size of the index. Prefix encoding and suffix truncation are the bread and butter of modern B+ tree implementations, and that would hurt a lot to walk away from.
* All K and V types must then be serializable and deserializable. In rust we represent this with traits that come from specific external libraries. serde took up half of the sled compilation time when it was being used, so I don't want to pull that back in as a mandatory requirement for all users. I could have my own traits for this, and if I do anything like this, it will be the approach I take, and then external crates will implement serde-sled etc... but by depending on external crates, it makes for a very brittle API surface that requires all users to target the exact same version of that external trait. The core must be self-consistent and not export any external types that would cause sharp dependency issues.
* I like the idea of caching values that are deserialized only once, to avoid repeated deserialization costs on hot items. Something like this may be implemented in sled, but I haven't figured out the best way to represent it without causing memory issues etc... And in the mean time I'm more tempted to just point people to the 3 solutions above that let them view their bytes as structured data without many deserialization costs.