Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is there an open source service designed with HDDs in mind that achieves similar performance? I know none of the big ones work that well with HDDs: MinIO, Swift, Ceph+RadosGW, SeaweedFS; they all suggest flash-only deployments.

Recently I've been looking into Garage and liking the idea of it, but it seems to have a very different design (no EC).



I would say that Ceph+RadosGW works well with HDDs, as long as 1) you use SDDs for the index pool, and 2) you are realistic about the number of IOPs you can get out of your pool of HDDs.

And remember that there's a multiplication of iops for any individual client iop, whether you're using triplicate storare or erasure coding. S3 also has iop multiplication, which they solve with tons of HDDs.

For big object storage that's mostly streaming 4MB chunks, this is no big deal. If you have tons of small random reads and writes across many keys or a single big key, that's when you need to make sure your backing store can keep up.


Lustre and ZFS can do similar speeds.

However, if you need high IOPS, you need flash on MDS for Lustre and some Log SSDs (esp. dedicated write and read ones) for ZFS.


Thanks, but I forgot to specify that I'm interested in S3-compatible servers only.

Basically, I have a single big server with 80 high-capacity HDDs and 4 high-endurance NVMes, and it's the S3 endpoint that gets a lot of writes.

So yes, for now my best candidate is ZFS + Garage, this way I can get away with using replica=1 and rely on ZFS RAIDz for data safety, and the NVMEs can get sliced and diced to act as the fast metadata store for Garage, the "special" device/small records store for the ZFS, the ZIL/SLOG device and so on.

Currently it's a bit of a Frankenstein's monster: using XFS+OpenCAS as the backing storage for an old version of MinIO (containerized to run as 5 instances), I'm looking to replace it with a simpler design and hopefully get a better performance.


It is probably worth noting that most of the listed storage systems (including S3) are designed to scale not only in hard drives, but horizontally across many servers in a distributed system. They really are not optimized for a single storage node use case. There are also other things to consider that can limit performance, like what does the storage back plane look like for those 80 HDDs, and how much throughput can you effectively push through that. Then there is the network connectivity that will also be a limiting factor.


It's a very beefy server with 4 NVMe and 20 HDD bays + a 60-drive external enclosure, 2 enterprise grade HBA cards set to multipath round-robin mode, even with 80 drives it's nowhere near the data path saturation point.

The link is a 10G 9K MTU connection, the server is only accessed via that local link.

Essentially, the drives being HDD are the only real bottleneck (besides the obvious single-node scenario).

At the moment, all writes are buffered into the NVMes via OpenCAS write-through cache, so the writes are very snappy and are pretty much ingested at the rate I can throw data at it. But the read/delete operations require at least a metadata read, and due to the very high number of small (most even empty) objects they take a lot more time than I would like.

I'm willing to sacrifice the write-through cache benefits (the write performance is actually an overkill for my use case), in order to make it a little more balanced for better List/Read/DeleteObject operations performance.

On paper, most "real" writes will be sequential data, so writing that directly to the HDDs should be fine, while metadata write operations will be handled exclusively by the flash storage, thus also taking care of the empty/small objects problem.


> Essentially, the drives being HDD are the only real bottleneck

? on the low end a single HD can deliver 100MB/s, 80 can deliver 8,000MB/s, a single nvme can do 700MB/s and you have 4, 2,800MB/s - a 10Gb link can only do 1000MB/s, so isn't your bottle neck Network and then probably CPU?


If your server is old, the RAID card's PCIe interface will be another bottleneck, alongside the latencies added if the card is not that powerful to begin with.

Same applies to your NVMe throughput since now you have the risk to congest the PCIe lanes if you're increasing line count with PCIe switches.

If there are gateway services or other software bound processes like zRAID, your processor will saturate way before your NIC, adding more jitter and inconsistency to your performance.

NIC is an independent republic on the motherboard. They can accelerate almost anything related to stack, esp. server grade cards. If you can pump the data to the NIC, you can be sure that it can be pushed at line speed.

However, running a NIC at line speed with data read from elsewhere on the system is not always that easy.


Hope you don't have expectations (over the long run) for high availability. At some point that server will come down (planned or unplanned).


For sure, there is zero expectations for any kind of hardware downtime tolerance, it's a secondary backup storage cobbled together from leftovers over many years :)

For software, at least with MinIO it's possible to do rolling updates/restarts since the 5 instances in docker-compose are enough for proper write quorum even with any single instance down.


I'm working on something that might be suited for this use-case at https://github.com/uroni/hs5 (not ready for production yet).

It would still need a resilience/cache layer like ZFS, though.


Ceph's S3 protocol implementation is really good.

Getting Ceph erasure coding set up properly on a big hard disk pool is a pain - you can tell that EC was shoehorned into a system that was totally designed around triple replication.


Coudl you eleborate what you mean by the last sentence?


Originally Ceph divided big objects into 4MB chunks, sending each chunk to an OSD server which replicated it to 2 more servers. 4MB was chosen because it was several drive rotations, so the seek+ rotational delay didn’t affect the throughput very much.

Now the first OSD splits it into k data chunks plus d parity chunks, so the disk write size isn’t 4MB, it’s 4MB/k, while the efficient write size has gone up 2x? 4x? since the original 4MB decision as drive transfer rates increase.

You can change this, but still the tuning is based on the size of the block to be coded, not the size of the chunks to be written to disk. (and you might have multiple pools with much different values of k)


I'm still not sure which exact Ceph concept you are referring to. Thre is the "minimum allocation size" [1], but that is currently 4 KB (not MB).

There is also striping [2], which is the equivalent of RAID-10 functionality to split a large file into independent segments that can be written in parallel. Perhaps you are referring to RGW's default stripe size of 4 MB [3]?

If yes, I can understand your point about one 4 MB RADOS object being erasure-coded to e.g. 6 = 4+2 "parity chunks", making it < 1 MB writes that are not efficient on HDDs.

But would you not simply raise `rgw_obj_stripe_size` to address that, according to the k you choose? E.g. 24 MB? You mention it can be changed, but I don't understand the "but still the tuning is based on the size of the block to be coded" part, (why) is that a problem?

Also, how else would you do it when designing EC writes?

Thanks!

[1]: https://docs.ceph.com/en/squid/rados/configuration/bluestore...

[2]: https://docs.ceph.com/en/squid/architecture/#data-striping

[3]: https://docs.ceph.com/en/squid/radosgw/config-ref/#confval-r...


If you can afford it, mirroring in some form is going to give you way better read perf than RAIDz. Using zfs mirrors is probably easiest but least flexible, zfs copies=2 with all devices as top level vdevs in a single zpool is not very unsafe, and something custom would be a lot of work but could get safety and flexibility if done right.

You're basically seek limited, and a read on a mirror is one seek, whereas a read on a RAIDz is one seek per device in the stripe. (Although if most of your objects are under the chunk size, you end up with more of mirroring than striping)

You lose on capacity though.


Yeah unfortunately mirrors is no go due to efficiency requirements, but luckily read performance is not that important if I manage to completely offload FS/S3 metadata and small files to flash storage (separate zpool for Garage metadata, separate special VDEV for metadata/small files).

I think I'm going to go with 8x RAIDz2 VDEVs 10x HDDs each, so that the 20 drives in the internal drive enclosure could be 2 separate VDEVs and not mix with the 60 in the external enclosure.


It's great to see other people's working solutions, thanks. Can I ask if you have backup on something like this? In many systems it's possible to store some data on ingress or after processing, which serves as something that's rebuildable, even if it's not a true backup. I'm not familiar if your software layer has backup to off site as part of their system, for example, which would be a great feature.


It might not be the most ideal solution, but did you consider installing TrueNAS on that thing?

TrueNAS can handle the OpenZFS (zRAID, Caches and Logs) part and you can deploy Garage or any other S3 gateway on top of it.

It can be an interesting experiment, and 80 disk server is not too big for a TrueNAS installation.


Do you know if some of these systems have components to periodically checksum the data at rest?


ZFS/OpenZFS can do scrub and do block-level recovery. I'm not sure about Lustre, but since Petabyte sized storage is its natural habitat, there should be at least one way to handle that.


Any of them will work just as well, but only with many datacenters worth of drives, which very few deployments can target.

It's the classic horizontal/vertical scaling trade off, that's why flash tends to be more space/cost efficient for speedy access.


SeaweedFS has evolved a lot the last few years, with RDMA support and EC.


At a past job we had an object store that used SwiftStack. We just used SSDs for the metadata storage but all the objects were stored on regular HDDs. It worked well enough.


Apache Ozone has multiple 100+ petabyte clusters in production. The capacity is on HDDs and metadata is on SSDs. Updated docs (staging for new docs): https://kerneltime.github.io/ozone-site/


Doing some light googling aside from Ceph being listed, there's one called Gluster as well. Hypes itself as "using common off-the-shelf hardware you can create large, distributed storage solutions for media streaming, data analysis, and other data- and bandwidth-intensive tasks."

It's open source / free to boot. I have no direct experience with it myself however.

https://www.gluster.org/


Gluster has been slowly declining for a while. It used to be sponsored by RedHat, but tha stopped a few years ago. Since then, development slowed significantly.

I used to keep a large cluster array with Gluster+ZFS (1.5PB), and I can’t say I was ever really that impressed with the performance. That said — I really didn’t have enough horizontal scaling to make it worthwhile from a performance aspect. For us, it was mainly used to make a union file system.

But, I can’t say I’d recommend it for anything new.


A decade ago where I worked we used gluster for ~200TB of HDD for a shared file system on a SLURM compute cluster, as a much better clustered version of NFS. And we used ceph for its S3 interface (RadowGW) for tens of petabytes of back storage after the high IO stages of compute were finished. The ceph was all HDD though later we added some SSDs for a caching pool.

For single client performance, ceph beat the performance I get from S3 today for large file copies. Gluster had difficult to characterize performance, but our setup with big fast RAID arrays seems to still outperform what I see of AWS's luster as a service today for our use case of long sequential reads and writes.

We would occasionally try cephFS, the POSIX shared network filesystem, but it couldn't match our gluster performance for our workload. But also, we built the ceph long term storage to maximize TB/$, so it was at a disadvantage compared to our gluster install. Still, I never heard of cephFS being used anywhere despite it being the original goal in the papers back at UCSC. Keep an eye on CERN for news about one of the bigger ceph installs with public info.

I love both of the systems, and see ceph used everywhere today, but am surprised and happy to see that gluster is still around.


I’ve used GlusterFS before because I was having tens of old PCs and it worked for me very well. It’s basically a PoC to see how it work than production though


We've been running a production ceph cluster for 11 years now, with only one full scheduled downtime for a major upgrade in all those years, across three different hardware generations. I wouldn't call it easy, but I also wouldn't call it hard. I used to run it with SSDs for radosgw indexes as well as a fast pool for some VMs, and harddrives for bulk object storage. Since i was only running 5 nodes with 10 drives each, I was tired of occasional iop issues under heavy recovery so on the last upgrade I just migrated to 100% nvme drives. To mitigate the price I just bought used enterprise micron drives off ebay whenever I saw a good deal popup. Haven't had any performance issues since then no matter what we've tossed at it. I'd recommend it, though I don't have experience with the other options. On paper I think it's still the best option. Stay away from CephFS though, performance is truly atrocious and you'll footgun yourself for any use in production.


We're using CephFS for a couple years, with some PBs of data on it (HDDs).

What performance issues and footguns do you have in mind?

I also like that CephFS has a performance benefits that doesn't seem to exist anywhere else: Automatic transparent Linux buffer caching, so that writes are extremely fast and local until you fsync() or other clients want to read, and repeat-reads or read-after-write are served from local RAM.


>Recently I've been looking into Garage and liking the idea of it, but it seems to have a very different design (no EC).

What you mean by no EC?


In their design document at https://garagehq.deuxfleurs.fr/documentation/design/goals/ they state: "erasure coding or any other coding technique both increase the difficulty of placing data and synchronizing; we limit ourselves to duplication"


Nice! Learned something new today. Seems like a way for error correction. One can store parts of data with some more metadata and if some parts of the data are lost, the original can be reconstructed via some use of computational power.

Seems like some kind of compression?

Is that how the error correction on DVD works? I

And is that how GridFS is can keep file store slow low compare to regular file system?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: