> JuiceFS, written in Go, can manage tens of billions of files in a single namespace.
At that scale, I care about integrity. Can someone working in this space please have a real integrity story as part of the offering? Give each object (object, version pair, perhaps) a cryptographic hash of the contents, and make that hash be part of the inventory. Allow the entire bucket to opt in to mandatory hashing. Let me ask the system to do a scrub in which it verifies those cryptographic hashes.
If this blows up metadata to 164 bytes per object, so be it. But the hash can probably get away with being stored with data, not metadata, as long as there is a mechanism to inventory those hashes. Keeping them in memory doesn’t seem necessary.
Even S3 has only desultory support for this. A lot of competitors have nothing of the sort.
JuiceFS relies on the object store to provide integrity for data. Besides that, JuiceFS stores the checksum of each object as tags in S3, and verifies that when downloading the objects.
Inside the metadata service, it uses merkle tree (hash of hash) to verify the integrity of whole namespace (including id of data blocks) between RAFT replicas. Once we store the hash (4 bytes) of each objects into metadata, it should provide the integrity of the whole namespace.
surely you mean that the FS should calculate the hash on file creation/update, not take some random value from the user. but I agree that a FS that maintains file-content hash should allow clients to query it.
No, the FS should verify the hash on creation/update. Otherwise corruption during creation/update would just cause the hash to match the corrupted data.
The Quobyte DCFS does end-to-end CRC32 for each 4k block of data. All metadata and communication is also CRC protected, although one other frame boundaries.
All distributed networks are vulnerable to Sybil attacks (unless you can ensure provenance somehow, out-of-band), but unless you can break the hash function, all that gets you with BitTorrent is denial of service (and traffic interception, I suppose, but that should already be part of your threat model).
There is absolutely a risk of downloading malicious data. But the protocol will reject it for failing integrity checks. That doesn't mean the software will reject it.
Are you saying mainstream torrent clients don't check the hash? As far as I know, not only do they, but they ban peers who have sent them wrong data more than once. So you could DoS them for a bit with lots of peers sending bad data, but you need a lot of ips to do that because you'll quickly get all of them banned. And unless you are doing this through residential proxies, people will learn your ranges and block you by default.
Maybe there's a DoS you could do with uTP by spoofing someone else's IP and getting them to ban a real peer, but you'd presumably have to get in between them requesting blocks and reply with bad ones, which realistically means you are a MitM, so you could DoS them more directly by just dropping their traffic.
Or if you mean more generally that a malicious packet could reach a client and exploit a memory bug or something, that applies to literally anything on a network.
Suppose you have a torrent client that saves chunks to the filesystem before performing integrity checks. Suppose also that you have an antivirus program that scans every newly-created file for malware… and someone sent you 42.zip. Sure, the torrent client will reject it later, but the damage has already been done.
This specific scenario is unlikely (most antivirus programs can cope with zip bombs, these days), but computers are complex. Other edge-cases might exist. Torrenting is safer than downloading something from your average modern website, but in practice it's nowhere as safe as the theoretical limit.
RocksDB supports hashing at multiple levels (key, value, files) because Meta also realized the importance of integrity. It also supports verifying them in bulk.
Presumably filesystems built over rocksdb also support this.
You're implementing a Slab allocator. That's exactly what any Slab allocator does, including gmalloc, tcmalloc, jemalloc and SLUB (Linux kernel). But, there is probably much much more you can do on a modern machine to squeeze even more performance. Some of the things that comes into my mind are: reducing tlb stalls with hugepage awareness[1], and reducing false cache sharing on SMPs [2].
For the compression part, have you thought about OS-managed memory compression [3][4]?
I'd like to learn more about JuiceFS, but from their architecture diagrams I'm struggling to see what benefit they provide if they're a layer over blob store systems like Ceph, MinIO, etc.
It looks as though you need to set up the underlying storage engine that has its own recordkeeping (one point of failure), then layer JuiceFS on top — another point of failure, and I don't know what it gives you.
It's good that it's fast, but if it's pointing to another blob store you'd... expect it to be fast? Is this only needed if you want properly faked file system primitives over blob stores, if you can't use blob stores directly? Definitely need to spend some time reading.
S3 is not designed for intensive metadata operations, like listing, renaming etc. For these operations, you will need a somewhat POSIX-complaint system. For example, if you want to train on ImageNet dataset, the “canonical” way [1] is to extract the images and organize them into folders, class by class. The whole dataset is discovered by directory listing. This where JuiceFS shines.
Of course, if the dataset is really massive, you will mostly end-up with in-house solutions.
S3's metadata speed is horrifically slow. Part of the reason fake filesystems on S3 also perform badly is that checking anything to do with where a file is located, takes >250ms.
Now if you think about how many files are in a git repo, and how many are touched when you commit, you can see the problem.
I'm not sure JuiceFS is the answer, given that the metadata is stored in memory. but it is an answer.
personally you're better off with their managed lustre offering
S3, as you know, is a key-based object store, with a hack that allows directories ('/' is just part of the key name, and is filtered) So its less of a deficiency, more of a tradeoff they went with to get performance/uptime. Enumerating the keys on an S3 bucket is pretty slow, so it makes any kind of listing operation for a follow on system slow.
Having a metadata cache is sensible option, especially if you have complete control of a bucket and are able to be canonical. (ie everything talks to your DB to get keynames)
But! what you can't do write to the middle of a file, you need to upload the whole thing again. This is not a problem for a lot of workflows, but for POSIX filesystems thats going to cause problems. You can seek to the middle of a file, write to it, the client will upload it to S3. What happens if another system modifies that file when you are uploading? Sounds like a locking nightmare.
JuiceFS is similar to HDFS/CephFS/Lustre, so it MUST has a component to manage metadata, similar to NameNode of HDFS or MDS of CephFS, this point of failure is the problem we have to address.
The underlying blob store systems is similar to DataNode or OSD in other distributed file system, could be slower than them a little bit because of the middle layers, the overall performance is determined by the disks.
So we can expect similar performance comparing to HDFS/CephFS, the benchmark results also confirm that.
> Is this only needed if you want properly faked file system primitives over blob stores, if you can't use blob stores directly?
Semiconductor EDA comes to mind, where users are at the mercy of their tool vendors and they expect something that really hasn't evolved much past an office network of Unix workstations from 1999. Object storage is almost completely alien to the tooling, and despite significant file & block storage costs there is little interest from EDA tool vendors to adapt their tools to object storage. This is a challenge for semiconductor firms operating in or moving to a cloud environment.
S3FS/rclone can of course act as a shim, but are very slow when it comes to metadata operations in a typical shell. But if you were to move your metadata away from the distant object store and closer to to your compute environment things actually start becoming usable - this is the case with JuiceFS. Of course there are also tiered storage systems like Weka would have better overall performance, but is more complicated to set up and more expensive to operate than JuiceFS.
I’ve been using community edition JuiceFS with Percona XtraDB cluster for Metadata and MinIO multi-node for a couple of years for large archival and backup data storage. That setup has worked really well.
2) never use distributed Filesystems unless you need a global namespace
3) distributed filesystems mean that you have a single point of failure, its just more complex and subtle form of failure.
4) You have to think about how partition works, and how you want to recover.
Sharding standard NFS fileservers is by far the easiest way to scale, have redundancy, and partition your failure zones. using autofs on a custom directory you can mount servers ondemand. /mnt/0 /mnt/1 /mnt/2 etc etc etc.
You then have the ability to chose your sharding scheme for speed, redundancy or some other constraint.
Nowadays the performance that distributed systems allowed are commodity. A single 4u storage server can saturate a 100gig network link with random disk IO.
We've used a some of Distributed storage (Longhorn, OpenEBS) things for Kubernetes and had no luck.
I agree with you at ALL the things you've listed. When we've started to used Ceph for the first time - (At that time I was an user, not a Manager) all the thing was a total mess. All the things are set to default and Use all JBODs on Node. (And I've started to face Corrupted WALs and OSDs came from nowhere.)
The handful of documentations was very helpful to cleans up and make it to run correctly.
Maybe, when I go back to the starting point, I'd like to say "move to cloud" or "do nothing" or following your guidance.
I should have been more gentle in my statement, for that I apologise.
if you do start again _and_ move to the cloud, AWS's lustre interface (FSx) is good enough (if expensive) or EFS, although that has slower metadata performance. Its good enough at RW iops though, assuming you monitor it correctly.
I feel your pain, and sympathise deeply. Best of luck!
Actually, the average here just implies that tested in different configurations and scenarios, it should cut by 100 bytes in averages. Net per file. Adding "per file" does have a benefit.
As a computational biologist, I have a lot of embarrassingly parallel problems, so the lightweight concurrency of Go is really easy. In addition, my algorithms and data perform comfortably in GC-land 99% of the time, reducing my mental overhead (for science!). I had to reach for unsafe.Pointer() only once and that was to use Other People's Code (TM).
I consistently find my code runs 20-80x faster than the equivalent in Python and I'm not about to cram C++ or Rust into my head along with all the biology I've memorized for ~2x performance improvement and >5x learning curve. On HN, I shouldn't need to explain why I don't use Java despite a similar set of tradeoffs to Go (other than memory use, historically).
They're only using 36% of that (so they have slack capacity). It's in-memory, which has some perf wins (though I'd question if distributed systems are worth it when you can get a PCIe splitter and just mmap some 4TB SSDs on whatever cheap desktop computer somebody has lying around).
100 bytes per file is less than you would expect to support all that data if you coded it "naturally" without an eye toward space efficiency. You'll need 16-32 bytes or so just to handle a tree structure with internal string indices to be able to support finding any given file's location quickly. Toss in created times, modified times, permissions, attributes, the fact that they have a chunking structure to support both random seek and efficient/easy allocation, .... It adds up to more than 100B if you're not pretty careful. They hit that 100B mark.
Is that impressive? I dunno. Their competitors didn't do it, and they did, so that's something.
I'll never understand the "we are open source but lock $cool_but_essential_features behind an enterprise license."
Unless you plan to charge less than what a single developer makes in a quarter or two, any of your potential customers could at any point patch in the feature and open source the changes for everyone.
(Traefik and Riak are the two biggest offenders I've ran into)
Those patches are unlikely to be officially supported, though. I really support this mode of monetization, because community/free users get something, well, free, while others get something more for a bit more. It’s [hopefully] sustainable for the engineers and team behind the product, too!
While the underlying idea that selling services on open source products is not sustainable from a purely business perspective may be true, the linked article is not particularly convincing to me, it doesn't provide strong evidences and it's based on an embarrassingly small sample size. Besides, maybe we should start to consider also other metrics when we evaluate a Business success, not only the mere economic profit: there are externalized societal benefits (and damages) that are very important and nonetheless poorly accounted for.
It only looks at one open source company, that could not have closed source its product (because it was not the original developer) so it could only exist at all as an (at least mostly) open source business and compares it to companies in entirely different lines of business.
It was also badly outdated a few years after the article was written when the said business was acquired for several times the market cap at the time the article was written.
If you want to compare it to businesses offering a similar product into a similar market you should compare it to other OS vendors founded in the 90s targetting desktops and servers. Any more successful proprietary examples around?
Most of all the article only looks at one, open source business. Not exactly a meaningful sample.
It is probably very difficult to rely on that model if a single company is the original developer and the main developer - so you have all the costs but share the benefits. It is not true if you fork code that already exists, or if you have a bazaar development mode.
In theory you’re right, but in most companies the friction of spinning up a team to manage the custom patch set of each piece of open-core software they use would dominate the actual effort needed to implement the features. It’s just not in most company’s DNA to operate this way.
(Also, a lot of SaaS is in fact cheaper than what a dev makes in half a year).
Competitive in this space is also SeaweedFS, which I've taken out for a local spin. It feels kindof like "memcached-for-fs/s3"... you add 30GB chunks of storage and it coalesces them for you without much ceremony. ~20 bytes storage overhead per file and performance seems pretty decent.
High performance systems in any language require these optimisations. Go provides a solid foundation to create the platform before you start optimising. Not saying it's the best or only choice, but it's not a bad choice.
High performance systems are where these kinds of optimisations make the most difference.
If most of your overhead is due to the low performance of the language/runtime you're using, these kinds of optimisations won't make as much of a difference. That is, if you can even implement them at all. I mean, good luck trying to use memory pools and manual memory management in Python or Javascript.
At that scale, I care about integrity. Can someone working in this space please have a real integrity story as part of the offering? Give each object (object, version pair, perhaps) a cryptographic hash of the contents, and make that hash be part of the inventory. Allow the entire bucket to opt in to mandatory hashing. Let me ask the system to do a scrub in which it verifies those cryptographic hashes.
If this blows up metadata to 164 bytes per object, so be it. But the hash can probably get away with being stored with data, not metadata, as long as there is a mechanism to inventory those hashes. Keeping them in memory doesn’t seem necessary.
Even S3 has only desultory support for this. A lot of competitors have nothing of the sort.