A Distributed File System in Go Cut Average Metadata Memory Usage to 100 Bytes

amluto · on Feb 23, 2024

> JuiceFS, written in Go, can manage tens of billions of files in a single namespace.

At that scale, I care about integrity. Can someone working in this space please have a real integrity story as part of the offering? Give each object (object, version pair, perhaps) a cryptographic hash of the contents, and make that hash be part of the inventory. Allow the entire bucket to opt in to mandatory hashing. Let me ask the system to do a scrub in which it verifies those cryptographic hashes.

If this blows up metadata to 164 bytes per object, so be it. But the hash can probably get away with being stored with data, not metadata, as long as there is a mechanism to inventory those hashes. Keeping them in memory doesn’t seem necessary.

Even S3 has only desultory support for this. A lot of competitors have nothing of the sort.

daviesliu · on Feb 23, 2024

JuiceFS relies on the object store to provide integrity for data. Besides that, JuiceFS stores the checksum of each object as tags in S3, and verifies that when downloading the objects.

Inside the metadata service, it uses merkle tree (hash of hash) to verify the integrity of whole namespace (including id of data blocks) between RAFT replicas. Once we store the hash (4 bytes) of each objects into metadata, it should provide the integrity of the whole namespace.

amluto · on Feb 23, 2024

Does JuiceFS allow the user to specify the hash of a file when uploaded? And then to read that hash back later?

Otherwise there’s no end-to-end integrity check.

daviesliu · on Feb 25, 2024

The S3 API allow user to specify the hash of content as HTTP header, it will be verified by the JuiceFS gateway and persisted into JuiceFS as ETag.

With POSIX API or HDFS, there is no such API to do that, unfortunately.

markhahn · on Feb 23, 2024

surely you mean that the FS should calculate the hash on file creation/update, not take some random value from the user. but I agree that a FS that maintains file-content hash should allow clients to query it.

amluto · on Feb 23, 2024

No, the FS should verify the hash on creation/update. Otherwise corruption during creation/update would just cause the hash to match the corrupted data.

yjftsjthsd-h · on Feb 23, 2024

I was under the impression that minio with erasure coding did that? ( https://min.io/docs/minio/linux/operations/concepts/erasure-... )

amluto · on Feb 23, 2024

Looks like minio added this in 2022:

https://github.com/minio/minio/pull/15433

fh973 · on Feb 23, 2024

The Quobyte DCFS does end-to-end CRC32 for each 4k block of data. All metadata and communication is also CRC protected, although one other frame boundaries.

EGreg · on Feb 23, 2024

IPFS, MaidSAFE and even BitTorrent have had this from day 1, and they are distributed and not under the control of one company.

FileCoin is used to pay for storage, but for now it seems storage is mostly free because of block rewards.

kfrzcode · on Feb 23, 2024

And they are slow, expensive and subject to node-based 33% attacks

Hedera Hashgraph is none of these

rakoo · on Feb 23, 2024

Bittorrent is free, as fast as my network collection allows, and verifies integrity. I disagree that it falls into your description.

kfrzcode · on Feb 24, 2024

You're right, I was flippant with my descriptors.

Geisterde · on Feb 23, 2024

Is bittorrent subject to sybil too?

wizzwizz4 · on Feb 23, 2024

All distributed networks are vulnerable to Sybil attacks (unless you can ensure provenance somehow, out-of-band), but unless you can break the hash function, all that gets you with BitTorrent is denial of service (and traffic interception, I suppose, but that should already be part of your threat model).

Geisterde · on Feb 23, 2024

I see, so there isnt a risk of downloading malicious content, just that you might not get the file. Thanks.

wizzwizz4 · on Feb 23, 2024

There is absolutely a risk of downloading malicious data. But the protocol will reject it for failing integrity checks. That doesn't mean the software will reject it.

ndriscoll · on Feb 23, 2024

Are you saying mainstream torrent clients don't check the hash? As far as I know, not only do they, but they ban peers who have sent them wrong data more than once. So you could DoS them for a bit with lots of peers sending bad data, but you need a lot of ips to do that because you'll quickly get all of them banned. And unless you are doing this through residential proxies, people will learn your ranges and block you by default.

Maybe there's a DoS you could do with uTP by spoofing someone else's IP and getting them to ban a real peer, but you'd presumably have to get in between them requesting blocks and reply with bad ones, which realistically means you are a MitM, so you could DoS them more directly by just dropping their traffic.

Or if you mean more generally that a malicious packet could reach a client and exploit a memory bug or something, that applies to literally anything on a network.

wizzwizz4 · on Feb 23, 2024

Suppose you have a torrent client that saves chunks to the filesystem before performing integrity checks. Suppose also that you have an antivirus program that scans every newly-created file for malware… and someone sent you 42.zip. Sure, the torrent client will reject it later, but the damage has already been done.

This specific scenario is unlikely (most antivirus programs can cope with zip bombs, these days), but computers are complex. Other edge-cases might exist. Torrenting is safer than downloading something from your average modern website, but in practice it's nowhere as safe as the theoretical limit.

jorticka · on Feb 23, 2024

RocksDB supports hashing at multiple levels (key, value, files) because Meta also realized the importance of integrity. It also supports verifying them in bulk.

Presumably filesystems built over rocksdb also support this.

KaiserPro · on Feb 23, 2024

The next bit:

> Its metadata engine uses an all-in-memory approach and achieves remarkable memory optimization

Sounds like a fun thing to recover after a crash.

ted_dunning · on Feb 23, 2024

They have a blog on that:

https://juicefs.com/en/blog/engineering/increase-performance...

daviesliu · on Feb 25, 2024

This post said another topic: how to backup the metadata of JuiceFS in readable format (JSON) and restore it into an empty database.

daviesliu · on Feb 25, 2024

The recover process is similar to what a database does after a crash, it loads the recent snapshot in disk and apply any newer transaction logs.

snissn · on Feb 23, 2024

do you use the hash of the items you're storing as the key/filename?

kfrzcode · on Feb 23, 2024

I think you could achieve this on Hedera's hashgraph - ABFT DLT

afr0ck · on Feb 23, 2024

You're implementing a Slab allocator. That's exactly what any Slab allocator does, including gmalloc, tcmalloc, jemalloc and SLUB (Linux kernel). But, there is probably much much more you can do on a modern machine to squeeze even more performance. Some of the things that comes into my mind are: reducing tlb stalls with hugepage awareness[1], and reducing false cache sharing on SMPs [2].

For the compression part, have you thought about OS-managed memory compression [3][4]?

[1] https://google.github.io/tcmalloc/temeraire.html

[2] https://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemal...

[3] https://dl.acm.org/doi/pdf/10.1145/3297858.3304053

[4] https://www.kernel.org/doc/Documentation/vm/zswap.txt

tonyhb · on Feb 23, 2024

I'd like to learn more about JuiceFS, but from their architecture diagrams I'm struggling to see what benefit they provide if they're a layer over blob store systems like Ceph, MinIO, etc.

It looks as though you need to set up the underlying storage engine that has its own recordkeeping (one point of failure), then layer JuiceFS on top — another point of failure, and I don't know what it gives you.

It's good that it's fast, but if it's pointing to another blob store you'd... expect it to be fast? Is this only needed if you want properly faked file system primitives over blob stores, if you can't use blob stores directly? Definitely need to spend some time reading.

thinxer · on Feb 23, 2024

For “cloud-native” apps, JuiceFS is not needed.

S3 is not designed for intensive metadata operations, like listing, renaming etc. For these operations, you will need a somewhat POSIX-complaint system. For example, if you want to train on ImageNet dataset, the “canonical” way [1] is to extract the images and organize them into folders, class by class. The whole dataset is discovered by directory listing. This where JuiceFS shines.

Of course, if the dataset is really massive, you will mostly end-up with in-house solutions.

[1]: https://github.com/pytorch/examples/blob/main/imagenet/extra...

KaiserPro · on Feb 23, 2024

S3's metadata speed is horrifically slow. Part of the reason fake filesystems on S3 also perform badly is that checking anything to do with where a file is located, takes >250ms.

Now if you think about how many files are in a git repo, and how many are touched when you commit, you can see the problem.

I'm not sure JuiceFS is the answer, given that the metadata is stored in memory. but it is an answer.

personally you're better off with their managed lustre offering

Geisterde · on Feb 23, 2024

Is this something thats a deficiency with S3, or is there not a more purpose built offering?

KaiserPro · on Feb 23, 2024

S3, as you know, is a key-based object store, with a hack that allows directories ('/' is just part of the key name, and is filtered) So its less of a deficiency, more of a tradeoff they went with to get performance/uptime. Enumerating the keys on an S3 bucket is pretty slow, so it makes any kind of listing operation for a follow on system slow.

Having a metadata cache is sensible option, especially if you have complete control of a bucket and are able to be canonical. (ie everything talks to your DB to get keynames)

But! what you can't do write to the middle of a file, you need to upload the whole thing again. This is not a problem for a lot of workflows, but for POSIX filesystems thats going to cause problems. You can seek to the middle of a file, write to it, the client will upload it to S3. What happens if another system modifies that file when you are uploading? Sounds like a locking nightmare.

snerbles · on Feb 23, 2024

In the case of JuiceFS the file is split into chunks and those are the objects uploaded to S3 - treating it like a high latency block device.

The downside is that metadata here isn't just a cache, it is necessary to operate in this fashion and must be backed up.

daviesliu · on Feb 23, 2024

JuiceFS is similar to HDFS/CephFS/Lustre, so it MUST has a component to manage metadata, similar to NameNode of HDFS or MDS of CephFS, this point of failure is the problem we have to address.

The underlying blob store systems is similar to DataNode or OSD in other distributed file system, could be slower than them a little bit because of the middle layers, the overall performance is determined by the disks.

So we can expect similar performance comparing to HDFS/CephFS, the benchmark results also confirm that.

snerbles · on Feb 23, 2024

> Is this only needed if you want properly faked file system primitives over blob stores, if you can't use blob stores directly?

Semiconductor EDA comes to mind, where users are at the mercy of their tool vendors and they expect something that really hasn't evolved much past an office network of Unix workstations from 1999. Object storage is almost completely alien to the tooling, and despite significant file & block storage costs there is little interest from EDA tool vendors to adapt their tools to object storage. This is a challenge for semiconductor firms operating in or moving to a cloud environment.

S3FS/rclone can of course act as a shim, but are very slow when it comes to metadata operations in a typical shell. But if you were to move your metadata away from the distant object store and closer to to your compute environment things actually start becoming usable - this is the case with JuiceFS. Of course there are also tiered storage systems like Weka would have better overall performance, but is more complicated to set up and more expensive to operate than JuiceFS.

tehcopec · on March 7, 2024

I’ve been using community edition JuiceFS with Percona XtraDB cluster for Metadata and MinIO multi-node for a couple of years for large archival and backup data storage. That setup has worked really well.

seungwoolee518 · on Feb 23, 2024

I've put this on table for distributed storage on Kubernetes.

It's quite simple in front, simple to explain that split object and metadata storage.

but it's too hard to select and tune each storage.

So I (or We)'ve decided to move to Ceph.

KaiserPro · on Feb 23, 2024

Ex HPC person here.

1) never cluster unless you have to.

2) never use distributed Filesystems unless you need a global namespace

3) distributed filesystems mean that you have a single point of failure, its just more complex and subtle form of failure.

4) You have to think about how partition works, and how you want to recover.

Sharding standard NFS fileservers is by far the easiest way to scale, have redundancy, and partition your failure zones. using autofs on a custom directory you can mount servers ondemand. /mnt/0 /mnt/1 /mnt/2 etc etc etc.

You then have the ability to chose your sharding scheme for speed, redundancy or some other constraint.

Nowadays the performance that distributed systems allowed are commodity. A single 4u storage server can saturate a 100gig network link with random disk IO.

seungwoolee518 · on Feb 23, 2024

We've used a some of Distributed storage (Longhorn, OpenEBS) things for Kubernetes and had no luck.

I agree with you at ALL the things you've listed. When we've started to used Ceph for the first time - (At that time I was an user, not a Manager) all the thing was a total mess. All the things are set to default and Use all JBODs on Node. (And I've started to face Corrupted WALs and OSDs came from nowhere.)

The handful of documentations was very helpful to cleans up and make it to run correctly.

Maybe, when I go back to the starting point, I'd like to say "move to cloud" or "do nothing" or following your guidance.

KaiserPro · on Feb 23, 2024

I should have been more gentle in my statement, for that I apologise.

if you do start again _and_ move to the cloud, AWS's lustre interface (FSx) is good enough (if expensive) or EFS, although that has slower metadata performance. Its good enough at RW iops though, assuming you monitor it correctly.

I feel your pain, and sympathise deeply. Best of luck!

seungwoolee518 · on Feb 23, 2024

Anyway, thanks for a helpful guidance to keep it simple on cloud Era.

daviesliu · on Feb 23, 2024

Agreed, the all-in-one solution (Ceph) should be better, if you have to setup all the components.

If you already have the infra (databases and object stores), then JuiceFS is the easiest solution to have a distributed file system.

invalidator · on Feb 23, 2024

headline.append(" Per File")

thwarted · on Feb 23, 2024

This is implied by "average", assuming it's understood that a filesystem's primary unit is the "file".

rat9988 · on Feb 23, 2024

Actually, the average here just implies that tested in different configurations and scenarios, it should cut by 100 bytes in averages. Net per file. Adding "per file" does have a benefit.

LittleCat38 · on Feb 23, 2024

Agree

teaearlgraycold · on Feb 23, 2024

    Uncaught TypeError: headline.append is not a function

biomcgary · on Feb 23, 2024

As a computational biologist, I have a lot of embarrassingly parallel problems, so the lightweight concurrency of Go is really easy. In addition, my algorithms and data perform comfortably in GC-land 99% of the time, reducing my mental overhead (for science!). I had to reach for unsafe.Pointer() only once and that was to use Other People's Code (TM).

I consistently find my code runs 20-80x faster than the equivalent in Python and I'm not about to cram C++ or Rust into my head along with all the biology I've memorized for ~2x performance improvement and >5x learning curve. On HN, I shouldn't need to explain why I don't use Java despite a similar set of tradeoffs to Go (other than memory use, historically).

gnarlouse · on Feb 23, 2024

> In production, 10 metadata nodes each with 512 GB of memory collectively manage over 20 billion files.

That doesn’t sound all that impressive—is it impressive?

hansvm · on Feb 23, 2024

They're only using 36% of that (so they have slack capacity). It's in-memory, which has some perf wins (though I'd question if distributed systems are worth it when you can get a PCIe splitter and just mmap some 4TB SSDs on whatever cheap desktop computer somebody has lying around).

100 bytes per file is less than you would expect to support all that data if you coded it "naturally" without an eye toward space efficiency. You'll need 16-32 bytes or so just to handle a tree structure with internal string indices to be able to support finding any given file's location quickly. Toss in created times, modified times, permissions, attributes, the fact that they have a chunking structure to support both random seek and efficient/easy allocation, .... It adds up to more than 100B if you're not pretty careful. They hit that 100B mark.

Is that impressive? I dunno. Their competitors didn't do it, and they did, so that's something.

mike_d · on Feb 23, 2024

I'll never understand the "we are open source but lock $cool_but_essential_features behind an enterprise license."

Unless you plan to charge less than what a single developer makes in a quarter or two, any of your potential customers could at any point patch in the feature and open source the changes for everyone.

(Traefik and Riak are the two biggest offenders I've ran into)

MoOmer · on Feb 23, 2024

Those patches are unlikely to be officially supported, though. I really support this mode of monetization, because community/free users get something, well, free, while others get something more for a bit more. It’s [hopefully] sustainable for the engineers and team behind the product, too!

mike_d · on Feb 23, 2024

So sell support? Seems to work really well for a lot of companies.

rapsey · on Feb 23, 2024

No it does not. It is a garbage business model that killed the vast majority of companies pursuing it. The rest barely make a living.

https://techcrunch.com/2014/02/13/please-dont-tell-me-you-wa...

panta · on Feb 23, 2024

While the underlying idea that selling services on open source products is not sustainable from a purely business perspective may be true, the linked article is not particularly convincing to me, it doesn't provide strong evidences and it's based on an embarrassingly small sample size. Besides, maybe we should start to consider also other metrics when we evaluate a Business success, not only the mere economic profit: there are externalized societal benefits (and damages) that are very important and nonetheless poorly accounted for.

graemep · on Feb 23, 2024

That article is badly flawed.

It only looks at one open source company, that could not have closed source its product (because it was not the original developer) so it could only exist at all as an (at least mostly) open source business and compares it to companies in entirely different lines of business.

It was also badly outdated a few years after the article was written when the said business was acquired for several times the market cap at the time the article was written.

If you want to compare it to businesses offering a similar product into a similar market you should compare it to other OS vendors founded in the 90s targetting desktops and servers. Any more successful proprietary examples around?

Most of all the article only looks at one, open source business. Not exactly a meaningful sample.

It is probably very difficult to rely on that model if a single company is the original developer and the main developer - so you have all the costs but share the benefits. It is not true if you fork code that already exists, or if you have a bazaar development mode.

markhahn · on Feb 23, 2024

"barely make a living" is what a normal business aims for.

daviesliu · on Feb 23, 2024

The company may want the product to be difficult to use, so they can sell more support, that's not win-win for both sides.

1propionyl · on Feb 23, 2024

You don't pay for features. You pay for support.

dspillett · on Feb 23, 2024

Well, yes. But also no. If you need/want a feature that is only included in the non-F/OSS release, then you are not just paying for the support.

umanwizard · on Feb 23, 2024

In theory you’re right, but in most companies the friction of spinning up a team to manage the custom patch set of each piece of open-core software they use would dominate the actual effort needed to implement the features. It’s just not in most company’s DNA to operate this way.

(Also, a lot of SaaS is in fact cheaper than what a dev makes in half a year).

nijave · on Feb 24, 2024

Cloud providers are making a killing on their ability to spin these types of teams up

ramses0 · on Feb 23, 2024

Competitive in this space is also SeaweedFS, which I've taken out for a local spin. It feels kindof like "memcached-for-fs/s3"... you add 30GB chunks of storage and it coalesces them for you without much ceremony. ~20 bytes storage overhead per file and performance seems pretty decent.

styluss · on Feb 23, 2024

> Techniques like memory pools, manual memory management, directory compression, and compact file formats reduced metadata memory usage by 90%.

It is interesting that high performance systems in Go end up needing these kind of performance optimizations.

Intermernet · on Feb 23, 2024

High performance systems in any language require these optimisations. Go provides a solid foundation to create the platform before you start optimising. Not saying it's the best or only choice, but it's not a bad choice.

Karellen · on Feb 23, 2024

High performance systems are where these kinds of optimisations make the most difference.

If most of your overhead is due to the low performance of the language/runtime you're using, these kinds of optimisations won't make as much of a difference. That is, if you can even implement them at all. I mean, good luck trying to use memory pools and manual memory management in Python or Javascript.