Building a distributed log using S3 (under 150 lines of Go)

techn00 · 2024-12-01T17:04:08 1733072648

It's amazing what you can do with S3. It's one of the best things that AWS has to offer.

I wonder, is there a formal definition for a set of primitives that allow you to build an ACID database? Assume an API of some kind (in this case, S3) that you can interact with - and provides, I don't know, locks, % durability, etc.

What would make you say, 'Having those primitives, I CAN build an ACID database on top of it'?

justincormack · 2024-12-01T17:27:36 1733074056

You basically just need a log [1].

Sirupsen goes through the basics in [2] (video)

I did a talk about building on object storage with lots of examples [3] (video) [4] (slides with links)

[1] https://engineering.linkedin.com/distributed-systems/log-wha... [2] https://www.youtube.com/watch?v=RFmajOeUKnE [3] https://www.youtube.com/watch?v=ei0wwTy6_G4 [4] https://static.sched.com/hosted_files/kccncna2024/c8/Object%...

ComputerGuru · 2024-12-01T17:45:46 1733075146

Having a consistent log is sufficient with an atomic compare operation is sufficient for a distributed database but its performance will be extremely questionable. CAS is always the slow step, in this case, pathologically slow. The magic is to do whatever you can to avoid it until absolutely necessary. The availability of consistent, ordered, synchronized timestamps across all nodes is something most distributed databases require as a prerequisite. How you handle violations of that (and to what degree of accuracy you can rely on it) make a considerable difference.

Depending on how you structure the underlying pages, you’ll get to decide how availability at the log level translates to availability in your user/app-facing interface and whether you will end up sacrificing consistency, availability, or partition tolerance.

Basically, S3 with its recent consistency guarantees and all-new CAS support is sufffiicent in-and-of itself. But for anything other than the most basic (least amount of data, lowest frequency writes, etc) you’ll need a considerable amount of magic to make it useable.

The most straightforward approach would be to use the existing whole of another database but swap out the backend and then tweak the frontend accordingly. SQLite lets you use custom vfs providers (already used to provide fairly efficient SQLite over http without serving the entirety of the database, but previously not for writes) and with Postgres you can use foreign data wrappers. But in both cases you’ll basically have to take out a lock to support writes, either on a page or a row (either risk lots of contention or introduce a ton of locking and network latency overhead).

ethegwo · 2024-12-01T17:14:00 1733073240

We are building tonbo: https://github.com/tonbo-io/tonbo , An embedded KV database allows to use S3 as storage backend, and we are trying to implement SQLite virtual table on it: https://github.com/tonbo-io/sqlite-tonbo a real pay-as-you-go DB.

grapesodaaaaa · 2024-12-01T17:18:04 1733073484

That’s really cool! I’m personally really interested in serverless DB offerings. I’m not sure if yours scales well, but I always seem to hit the limits of a single RDBMS instance at some point as a product matures.

There are plenty of ways to scale out traditional RDBMS, but serverless offerings make it so easy to scale out.

ethegwo · 2024-12-02T12:44:02 1733143442

Thanks, easy-to-scale is the first thing we consider, also using S3 as a shared storage service makes architecture easy to achieve this.

arianvanp · 2024-12-01T17:23:17 1733073797

This blog post is great in explaining this: https://blog.mbrt.dev/posts/transactional-object-storage/

n00j · 2024-12-01T17:12:19 1733073139

I would guess something like Apache Iceberg would be something close to this? https://iceberg.apache.org/

Is a table format which can be use via trino, spark, flink, java apis, pything API?

rad_gruchalski · 2024-12-01T17:05:56 1733072756

Wake me up when s3 supports wrie at offset. Until then it’s all gimmicky. Writing small objects and retrieving them later is very inefficient and costly for large data volumes. One can do roll ups, sure, but with roll ups there’s no longer a way to search through the single rolled up file. One needs some compute to download the complete file and process it outside of s3.

ncruces · 2024-12-01T17:28:48 1733074128

S3 can at least do a multi-part upload, where any given part is a copy of a range of an existent object. Then you can finish the upload overwriting the previous object.

GCS, unfortunately, does not support copying a range. OTOH, it has long supported object append through composition.

The challenge with both offerings is that writes to a single object, and writes clustered around a prefix, are seriously rate limited, and consistency properties mostly apply to single objects.

rad_gruchalski · 2024-12-01T17:37:34 1733074654

Yeah, but you cannot multipart single chunk into a larger complete file. You need all chunks one way or another. Multipart upload starts and ends from all chunks. GCS and Azure support this too. S3 does a maximum of 1k objects,

GCS 32 objects, and Azure blob storage, afair 5k objects. Both can do an operation similar to what you described for S3 with various alternatives of read at offset + length and rolling those up.

In all cases, you end up always rolling up into a new key that isn’t available for read until roll up is done. It’s kinda useless for heavy write scenario.

Compare that to your normal fs operation. Write at offset to an existing file with size smaller that offset will just truncate the file to the offset, and continue writing.

ncruces · 2024-12-01T18:57:01 1733079421

You can?

You create a multi-part with 3 chunks: the 1st part is a copy range of the prefix, the 2nd part the bit you want to change, and the 3rd a copy range of the suffix?

And yes, all of this is useless for heavy (and esp. concurrent) writes.

rad_gruchalski · 2024-12-01T19:10:08 1733080208

We both said the same thing. You kinda can but cannot. Yes, you can replace some part of an existing object but you cannot resize it, not can you do anything parallel with that. So you kina can but cannot. And this trick will work in gcs and azure, here you have to move the new object to an old key yourself after the roll up. But why not while you’re already at it.

ncruces · 2024-12-01T20:58:40 1733086720

You can do it “in place” as the target can be the same as the source. And you can definitely resize it, both truncate it and extend it. The only restriction, really, is that all parts except for the last one need to be at least 5MiB.

GCS compose can also have target be one of the source objects, so you can append (and/or prepend) “in place.”

For GCS compose the suffix/prefix need to be separate visible objects (though you can put a lifecycle on them). For multipart, the parts are not really objects ever.

The performance isn't great because because updating the “index” is slow and rate limited, not because the APIs aren't there.

MadsRC · 2024-12-01T17:11:37 1733073097

It’s actually surprisingly efficient if you batch writes at the expense of some added latency. The WarSyream team found that batching into chunks of either 4MB of data or 250ms was optimal.

Downside is the 250ms latency. But then again, a fair amount of workloads can deal with 250ms of latency.

rad_gruchalski · 2024-12-01T17:38:57 1733074737

Kafka does batch reading anyway. Have a look at the reader, it’s a loop with read and read timeout. Usually a 100ms per single loop iteration.

ethegwo · 2024-12-01T17:11:27 1733073087

Random reads and sequential writes are enough to build a log-structured database, but S3 does not really support the latter.

rad_gruchalski · 2024-12-01T17:42:57 1733074977

One can do sequential writes by simply writing to a chunk objects at a key with the offset in the name. For example, sizes in MBs:

tmp/uploads/object-0, tmp/uploads/object-1024, tmp/uploads/object-2048

would be a rolled up object of size 2048MB + whatever is in the object-2048 file.

Mave83 · 2024-12-01T17:02:00 1733072520

Small objects are very inefficient in s3. Aggregate them together and form bigger log objects is critical to go from a small system log to a real environment.

vitaliyf · 2024-12-01T17:22:48 1733073768

The company I work for open-sourced a straightforward library that does exactly that: https://github.com/embrace-io/s3-batch-object-store

avinassh · 2024-12-01T17:16:42 1733073402

(author here)

definitely!

I plan to add a batch write API. Also, an API where it buffers till it reaches certain size or a timeout to write to S3

tracking the batch write here: https://github.com/avinassh/s3-log/issues/3

MadsRC · 2024-12-01T17:06:47 1733072807

This is why systems such as WarpStream regularly runs compaction jobs to more efficiently store objects and cut down on API calls.

EGreg · 2024-12-01T17:13:49 1733073229

Thanks to “goofys”, I am able to map an existing directory to S3 buckets, so I am never locked into AWS.

(The goofys is faster than s3fs because it’s not totally POSIX compliant.)

I am a big fan of setting up a full stack on a COMMODITY SERVER and it just working. You can outsource your video transcoding and storage to vimeo, AWS etc. but you don’t have to! You can use your own hard drive in your home to run a social network for instance.

peanut-walrus · 2024-12-01T17:29:50 1733074190

Building a system to write logs is easy. Getting something to actually read and process them is difficult.

ipv6ipv4 · 2024-12-01T18:07:47 1733076467

Two independent concurrent writers will constantly conflict on sequence numbers. This will force both to call LastRecord() for practically every append.

oefrha · 2024-12-02T00:17:48 1733098668

Forget about independent writers, Append() isn’t thread-safe (as in there’s no locking on .length; there’s no data race) so you’ll get constant conflicts with two goroutines logging to the same writer.

ipv6ipv4 · 2024-12-02T02:05:56 1733105156

The writeup mentions multiple writers in a couple places but it doesn't really specify how, or where. The writers could be in different processes or on different machines.

oefrha · 2024-12-02T03:20:00 1733109600

Yeah, I’m saying it doesn’t even handle in-process concurrent writes very well, let alone distributed ones.

smooc · 2024-12-01T17:10:06 1733073006

I wonder if there is a write guarantee, i.e. can we be certain the function only returns when the entry has been written to durable storage?

Or is it best effort?

avinassh · 2024-12-01T17:15:18 1733073318

(author here)

Currently, there is durability guarantee. The call returns only after successful write to S3. This is close to fsync in a single node system. I plan to add a relaxed write mode

I am tracking the issue here: https://github.com/avinassh/s3-log/issues/4

isuckatcoding · 2024-12-01T17:22:30 1733073750

Aren’t sequential numeric keys for the logs like this going to pose a hot partition problem at some point or is that only for prefixes?