It's amazing what you can do with S3. It's one of the best things that AWS has to offer.
I wonder, is there a formal definition for a set of primitives that allow you to build an ACID database? Assume an API of some kind (in this case, S3) that you can interact with - and provides, I don't know, locks, % durability, etc.
What would make you say, 'Having those primitives, I CAN build an ACID database on top of it'?
Having a consistent log is sufficient with an atomic compare operation is sufficient for a distributed database but its performance will be extremely questionable. CAS is always the slow step, in this case, pathologically slow. The magic is to do whatever you can to avoid it until absolutely necessary. The availability of consistent, ordered, synchronized timestamps across all nodes is something most distributed databases require as a prerequisite. How you handle violations of that (and to what degree of accuracy you can rely on it) make a considerable difference.
Depending on how you structure the underlying pages, you’ll get to decide how availability at the log level translates to availability in your user/app-facing interface and whether you will end up sacrificing consistency, availability, or partition tolerance.
Basically, S3 with its recent consistency guarantees and all-new CAS support is sufffiicent in-and-of itself. But for anything other than the most basic (least amount of data, lowest frequency writes, etc) you’ll need a considerable amount of magic to make it useable.
The most straightforward approach would be to use the existing whole of another database but swap out the backend and then tweak the frontend accordingly. SQLite lets you use custom vfs providers (already used to provide fairly efficient SQLite over http without serving the entirety of the database, but previously not for writes) and with Postgres you can use foreign data wrappers. But in both cases you’ll basically have to take out a lock to support writes, either on a page or a row (either risk lots of contention or introduce a ton of locking and network latency overhead).
That’s really cool! I’m personally really interested in serverless DB offerings. I’m not sure if yours scales well, but I always seem to hit the limits of a single RDBMS instance at some point as a product matures.
There are plenty of ways to scale out traditional RDBMS, but serverless offerings make it so easy to scale out.
Wake me up when s3 supports wrie at offset. Until then it’s all gimmicky. Writing small objects and retrieving them later is very inefficient and costly for large data volumes. One can do roll ups, sure, but with roll ups there’s no longer a way to search through the single rolled up file. One needs some compute to download the complete file and process it outside of s3.
S3 can at least do a multi-part upload, where any given part is a copy of a range of an existent object. Then you can finish the upload overwriting the previous object.
GCS, unfortunately, does not support copying a range. OTOH, it has long supported object append through composition.
The challenge with both offerings is that writes to a single object, and writes clustered around a prefix, are seriously rate limited, and consistency properties mostly apply to single objects.
Yeah, but you cannot multipart single chunk into a larger complete file. You need all chunks one way or another. Multipart upload starts and ends from all chunks. GCS and Azure support this too. S3 does a maximum of 1k objects,
GCS 32 objects, and Azure blob storage, afair 5k objects. Both can do an operation similar to what you described for S3 with various alternatives of read at offset + length and rolling those up.
In all cases, you end up always rolling up into a new key that isn’t available for read until roll up is done. It’s kinda useless for heavy write scenario.
Compare that to your normal fs operation. Write at offset to an existing file with size smaller that offset will just truncate the file to the offset, and continue writing.
You create a multi-part with 3 chunks: the 1st part is a copy range of the prefix, the 2nd part the bit you want to change, and the 3rd a copy range of the suffix?
And yes, all of this is useless for heavy (and esp. concurrent) writes.
We both said the same thing. You kinda can but cannot. Yes, you can replace some part of an existing object but you cannot resize it, not can you do anything parallel with that. So you kina can but cannot. And this trick will work in gcs and azure, here you have to move the new object to an old key yourself after the roll up. But why not while you’re already at it.
You can do it “in place” as the target can be the same as the source. And you can definitely resize it, both truncate it and extend it. The only restriction, really, is that all parts except for the last one need to be at least 5MiB.
GCS compose can also have target be one of the source objects, so you can append (and/or prepend) “in place.”
For GCS compose the suffix/prefix need to be separate visible objects (though you can put a lifecycle on them). For multipart, the parts are not really objects ever.
The performance isn't great because because updating the “index” is slow and rate limited, not because the APIs aren't there.
It’s actually surprisingly efficient if you batch writes at the expense of some added latency. The WarSyream team found that batching into chunks of either 4MB of data or 250ms was optimal.
Downside is the 250ms latency. But then again, a fair amount of workloads can deal with 250ms of latency.
Small objects are very inefficient in s3. Aggregate them together and form bigger log objects is critical to go from a small system log to a real environment.
Thanks to “goofys”, I am able to map an existing directory to S3 buckets, so I am never locked into AWS.
(The goofys is faster than s3fs because it’s not totally POSIX compliant.)
I am a big fan of setting up a full stack on a COMMODITY SERVER and it just working. You can outsource your video transcoding and storage to vimeo, AWS etc. but you don’t have to! You can use your own hard drive in your home to run a social network for instance.
Two independent concurrent writers will constantly conflict on sequence numbers. This will force both to call LastRecord() for practically every append.
Forget about independent writers, Append() isn’t thread-safe (as in there’s no locking on .length; there’s no data race) so you’ll get constant conflicts with two goroutines logging to the same writer.
The writeup mentions multiple writers in a couple places but it doesn't really specify how, or where. The writers could be in different processes or on different machines.
Currently, there is durability guarantee. The call returns only after successful write to S3. This is close to fsync in a single node system. I plan to add a relaxed write mode
I wonder, is there a formal definition for a set of primitives that allow you to build an ACID database? Assume an API of some kind (in this case, S3) that you can interact with - and provides, I don't know, locks, % durability, etc.
What would make you say, 'Having those primitives, I CAN build an ACID database on top of it'?