More

krenoten · on May 28, 2018

This was my interpretation as well. I'm going to compare a disk-backed bwtree with a disk-backed ART, both backed by the same pagecache, and maybe end up with an ART that scatters partial pages on disk, bwtree style. But I need to measure apples to apples on the metrics that matter for storage first. The pagecache is where most of the complexity is in my implementation, and it makes building different kinds of persistent structures on top of it pretty easy. docs.rs/pagecache

krenoten · on May 28, 2018

Yeah, I'm curious about using sled as a more ssd friendly storage engine for mentat. I'm just starting to experiment with datalog implementations, but I think by having harmony between the storage engine, query language, and hardware properties we can make a really compelling stateful systems. If this is something that interests you, I'd love to work with more people on this.

scary · on May 28, 2018

Sounds very exciting!

samuell · on May 28, 2018

Sounds exciting!

krenoten · on May 28, 2018

It might not. But the critiques of bw trees in terms of performance that I've seen have not had compelling data in terms of things that matter outside of academia or benchmarking shootouts, like write or space amplification. The bw tree is a cheap thing to abandon after I implement a persistent ART and measure it though. The bwtree is only like 1k of rust on top of the modular pagecache, which is the real heart of the system.

lifepillar · on May 28, 2018

I am not familiar with bw-trees, but are you aware of this paper from this year’s SIGMOD?

https://db.cs.cmu.edu/papers/2018/mod342-wangA.pdf

Moving to ARTs (possibly combined with B+ trees) might be a smart choice after all.

Edit: sorry, I missed that grand-parent has cited the same paper already.

krenoten · on May 28, 2018

This is honestly a use case I'm experimenting with using a mix of CRDTs and OT. Our systems are becoming more and more location agnostic and I don't feel that our current data infrastructure is adequate to serve the workloads we're going to be facing as compute migrates to the edge.

krenoten · on May 28, 2018

Use it for large scale HTAP! It's great for its flexible use cases at high scales :)

krenoten · on May 28, 2018

I am a total devotee to their approach to building simulable systems, although I seek to push it even farther and integrate lineage driven fault injection from an early stage. I see a lot of cool things in what they have done, technically. Sled is free from day one.

krenoten · on May 28, 2018

It's not needed for the single-key atomic record store, which is the sled bwtree index that is the current highest level module. MVCC is implemented in most popular embedded DBs because it is an effective way to manage mixed workloads that seek to read snapshots of the entire database at a single point in time as well as not blocking writes as this happens. This functionality is desirable for transactions that support mixed workloads. That's why I'm building it for a higher level module. This is a collection of modules that let you choose the abstraction and associated complexity that you want. There's also a modular pagecache that is totally decoupled and reusable for your own database experiments.

eternalban · on May 28, 2018

MVCC is a viable general purpose approach to deal with concurrency. It is the mother of all DAGs.

krenoten · on May 28, 2018

It's modular, and there is a paxos implementation, but it has been built totally in simulation so far and I haven't plugged it into an io layer yet. But this is trivial. That said, sled will always be a bwtree index, and the other modular crates will stand on their own.

krenoten · on May 28, 2018

Currently it's even more basic. The current usable parts are a pagecache following the llama approach, some great testing utility libraries, and an index (that you can use as a kv) that follows the bwtree approach. Later it will have structured access support, but it needs some more db components to get there. It is a construction kit as well as a kv.

Hello71 · on May 28, 2018

So... SQLite FoundationDB?

krenoten · on May 28, 2018

ALICE showed that's not always true with sqlite. Sled is being built with an extreme bias toward reliability over features, but as the readme says, it has some time to go before reaching maturity. The tests are quite good at finding new issues and deterministically replaying them, so you can help bake it in by mining bugs using the default test suite and help it get there.

toolslive · on May 28, 2018

Most database designers assume that a power failure will only affect writes that are pending. Alas, for SSDs and NVMEs that's not always true. A power failure can cause all kinds of corruption. Long story short: even append-only strategies will not save you.

https://www.usenix.org/system/files/conference/fast13/fast13...

krenoten · on May 28, 2018

Indeed. This is why I aggressively checksum everything and pay particular attention to throwing away all data that was written after any detected corruption during recovery. This is easier with the log-only architecture. It's also totally os and filesystem agnostic. I was happily surprised yesterday when it passed tests on fuchsia :]

toolslive · on May 28, 2018

So you might end up throwing away everything. (I know, it's not your fault)

krenoten · on May 28, 2018

Users can rely on sequential recovery. At some point I'll probably write a partial recovery tool that gives you all versions of all keys that are at all present anywhere in the readable file though, which won't be much work. Typical best practices encourage moving away from single disk reliance for particularly valuable data, but this library will also work on phones etc... So it is important to support people when a wide variety of things go wrong.

spacenick88 · on May 28, 2018

That paper sounds like problems that can not be worked around in software and need hardware fixes?

toolslive · on May 28, 2018

You could work around it (erasure coding, fe), if you had insights on the failure mode specifics, but it's vendor specific and vendors are not exactly forthcoming.

So the only thing you can do is add checksumming schemes that allow you to detect you have been affected.

The paper is from 2013, so the situation might have improved meanwhile (I wouldn't put any money on it)