Hacker News new | past | comments | ask | show | jobs | submit login
[dupe] The Sorry State of Copy-On-Write File Systems (louwrentius.com)
37 points by pjotrligthart on Aug 31, 2015 | hide | past | favorite | 37 comments




Someone should write an article on the sorry state of file system in general. ZFS and BTRFS are improvements, though still not quite there yet. But distance between user and storage still seems vast and primitive. Perhaps Seagates Kinetic drives are the future we need? (http://www.seagate.com/tech-insights/kinetic-vision-how-seag...)


Translation: "ZFS and btrfs are 'incomplete' because their handling of striped RAID is incomplete."

Disregarding the fact that things like mdadm still exist, and further disregarding the fact that the vast majority of filesystems out there don't bother with trying to implement RAID at all (probably because - again - there are things like mdadm that do that already), RAID5/6 are generally a bad idea compared to a RAID1 or RAID10. Both RAID5 and RAID6 make multiple very incorrect assumptions:

* Failure of multiple drives in a short period of time is rare (in reality, if one drive fails (excluding bathtub-curve-related infant mortality), the likelihood of subsequent drive failures increases significantly)

* The failed drive can be replaced and repopulated quickly (in reality, this is becoming less and less true as drives get bigger and bigger, thus taking more and more time to rebuild the failed array member; SSDs buy some time here, but that's not sustainable)

* Bit rot / "cosmic rays" are rare (in reality, silent data errors happen all the time, as has been demonstrated [0] as an example of why RAID5/6 is woefully insufficient for even the most basic protection against data corruption)

Basically, if you want something striping, and care at all about data integrity, go with RAID10. Only use RAID5/6 if you don't care about data loss (whether due to a comprehensive backup policy or a comprehensive redundancy policy on a machine-level), though in that case you might as well just cut to the chase and use RAID0.

I really wish this article would've dug into some of the real shortcomings of these filesystems; btrfs in particular would be incredibly useful compared to a more traditional LVM approach if it supported file encryption and swap subvolumes (for either of these things, LVM (+ LUKS for encryption) is necessary with or without btrfs). Instead they get criticized over not supporting things that a sane sysadmin wouldn't touch with a ten-foot pole in this day and age.

[0]: http://www.miracleas.com/BAARF/Why_RAID5_is_bad_news.pdf


>Basically, if you want something striping, and care at all about data integrity, go with RAID10. Only use RAID5/6 if you don't care about data loss (whether due to a comprehensive backup policy or a comprehensive redundancy policy on a machine-level), though in that case you might as well just cut to the chase and use RAID0.

Your points are valid but this doesn't follow. RAID10 is strictly worse than RAID6 for redundancy. In a four drive array any two drives can fail with RAID6 whereas with RAID10 2 drives may be enough to blow up half your data. RAID10 is actually a less redundancy but higher performance tradeoff compared to RAID6.


I should've clarified that RAID1 is the absolute best if data integrity is your sole concern. Regardless...

> RAID10 is strictly worse than RAID6 for redundancy.

Only if you use two drives per mirror; RAID10's worst-case matches RAID6's when using three disks per mirror.

RAID6 also lacks a "best case" that's distinct from its worst case. You lose three disks, you're hosed, period. You lose three disks in a RAID10 (even one with two disks per mirror), it's much more probable that your data will be intact. You can mitigate this further by using a bigger array (RAID10 total failure probability is affected by array size, unlike RAID6) or by using different vendors for each disk in the mirrors (example: each mirror has one Intel and one Samsung SSD of the same capacity) - which, while having some performance and capacity implications, actually mitigates the "oh no I bought a bad batch of hard drives and now my RAID is kaput" failure case (unlike with RAID6, where mixing/matching vendors won't help you unless you mix/match them for every disk).

There are also some recoverability benefits of RAID10 v. RAID6; notably, recovering from a fault in RAID6 requires recomputing parity, while no such thing is necessary for RAID10. This mitigates the "oh no my array took too long to fix itself and now my RAID is kaput" failure case.


>Only if you use two drives per mirror; RAID10's worst-case matches RAID6's when using three disks per mirror.

That's not a proper comparison as you are now using more disks in the RAID10 array than the RAID6 to get the same space. If you're willing to have 6 disks to get the space of 2 you could use an array with 3 parity disks (what ZFS calls Z3) and now you can fail any 3 disks just like RAID10 and you've only used 5 disks instead of 6. If you have the option for 4 parity disks (I don't think ZFS provides a Z4 level) you could fail any 4 drives of 6 and get the space of 2, so 1 better than RAID10. N-Parity setups are strictly more redundant than RAID10 for the same number of disks and required capacity.

>actually mitigates the "oh no I bought a bad batch of hard drives and now my RAID is kaput" failure case (unlike with RAID6, where mixing/matching vendors won't help you unless you mix/match them for every disk).

This makes no sense to me. I mix and match vendors in RAID6 arrays just fine and it lowers the correlation of failures between disks just as well. In fact parity RAID is again strictly better at this as well. Since the redundancy is across the whole array instead of split into clusters whenever you add more drives from different manufacturers you're adding that redundancy to the whole array. With RAID10 if you have 3 drives per mirror your maximum redundancy before losing data is 3 manufacturers.

>There are also some recoverability benefits of RAID10 v. RAID6; notably, recovering from a fault in RAID6 requires recomputing parity, while no such thing is necessary for RAID10. This mitigates the "oh no my array took too long to fix itself and now my RAID is kaput" failure case.

Yeah, performance is the real advantage of RAID10 as I mentioned and indeed that performance can be important during rebuilds so it has some durability implications.


> That's not a proper comparison as you are now using more disks in the RAID10 array than the RAID6 to get the same space.

You already have to use more disks for a RAID10 than in a RAID6, so I don't see why this is that big of a deal. Anything involving mirroring will involve at least a 50% cut in capacity (or more, such as with a 3-way or even 4-way mirror). RAID1 and RAID10 don't prioritize capacity; they prioritize robustness.

> N-Parity setups are strictly more redundant than RAID10 for the same number of disks and required capacity.

But not more robust against failure, as I've already described (hopefully) rather clearly. If we want to play this game, I can keep adding more disks per mirror, and now my worst-case will match that of your n-parity array while having a vastly better best-case (and, additionally, dragging the probability of that worst-case happening closer and closer to zero). Remember: if you're a RAID1 or RAID10 user who cares about disk capacity, you're doing something wrong.

> I mix and match vendors in RAID6 arrays just fine and it lowers the correlation of failures between disks just as well.

Do you mix and match all disks, though? Because if you have at least three (in the case of RAID6) disks, no matter where they are in the array, that have the same failure curve (i.e. are the same model/manufacturer), then you're gonna have a bad time when the first one of them dies.

That's my point. In order to get the same inherent benefit as RAID10 here, you'd have to have a different vendor/model for every single drive in your RAID6 array.

> whenever you add more drives from different manufacturers you're adding that redundancy to the whole array.

Only if they're all different. You'll eventually get to the point where there aren't enough manufacturers in the world (or you'll be compromising on manufacturer diversity - which is, granted, what most people do).

> With RAID10 if you have 3 drives per mirror your maximum redundancy before losing data is 3 manufacturers.

And that's fine, because if one of those manufacturers sold me a defective batch, the problem's isolated to that side of the mirror. That's my point; with RAID10, it's no longer a game of sheer numbers of different models and manufacturers, but instead the more manageable strategy of giving each side of the mirror a different bathtub curve to ride, thus mitigating the chance of multiple hard drive batches failing at the same time.


> You already have to use more disks for a RAID10 than in a RAID6, so I don't see why this is that big of a deal.

Not really. RAID6 and RAID10 are only directly comparable at 4 disks as they have the same capacity with the same number of disks. And in that case RAID6 is strictly better than RAID10 (any 2 disks can fail vs at best 2 disks can fail).

>But not more robust against failure, as I've already described (hopefully) rather clearly. If we want to play this game, I can keep adding more disks per mirror, and now my worst-case will match that of your n-parity array while having a vastly better best-case

This is simply not true. An n-parity array will always be strictly (and I mean strictly) more redundant than the equivalent RAID array. For 6 disks this means using 4 parity disks which allows failing any 4 disks which is RAID10's best case but better than it's worst case (any 2 disks can fail). If you go for 8 disks RAID10 allows a best case where 6 disks can fail which with a 6-parity raid is your normal case, and so on. Parity is strictly better than mirroring at redundancy its just usually not worth it for performance reasons. At the file level it may make sense though (Backblaze does file level 3-parity for example) as you can choose to take the hit only on certain files. Brtfs also allows per-file raid levels so it's too bad they're not pursuing that external patch for n-parity raid levels that the original article mentions.

(skipped some parts as the core of the issue is below)

>And that's fine, because if one of those manufacturers sold me a defective batch, the problem's isolated to that side of the mirror. That's my point; with RAID10, it's no longer a game of sheer numbers of different models and manufacturers, but instead the more manageable strategy of giving each side of the mirror a different bathtub curve to ride, thus mitigating the chance of multiple hard drive batches failing at the same time.

This is not a RAID10 advantage it's a way to mitigate the RAID10 disadvantage compared to parity. Parity RAID with the same capacity/number of disks can survive more disk failures than RAID10 (as I explained above). In both cases you want the pool of N disks to be as diversified as possible to avoid correlation between failures. Parity exploits that lack of correlation completely (any disk failure is just like any other, it doesn't matter which disk is which) whereas RAID10 will blow up earlier if all disks in one side fail at once so you mitigate that by making the two sides different.


> RAID6 and RAID10 are only directly comparable at 4 disks

They're really not ever directly comparable, because RAID10 isn't a hard-and-fast RAID level like RAID6, but rather the combination of two (really, one plus a quasi-level, but whatever). RAID10 encapulates any situation where mirrors are striped together.

> For 6 disks this means using 4 parity disks which allows failing any 4 disks which is RAID10's best case but better than it's worst case (any 2 disks can fail).

You're getting caught up on redundancy for a given total number of disks, and in the process missing the point entirely. If we really want to play that game, then I'll just roll a RAID1 (a.k.a. a RAID10 with only one set of mirrors) and call it a day, because with a n-disk array, a RAID1 will always be at least as redundant as the equivalent parity-based RAID (really, it'll always be able to survive one more disk failure than an n-1 parity-block array; you can see this rather easily when comparing RAID5 with a three-way RAID1).

From there, my point should be more clear. Just like how you can constantly add more and more parity blocks (and disks to handle them), I can constantly widen the mirror. Eventually we'll both get to the point where we realize "what the hell are we doing; there's 100 copies of every block; we should start adding some more non-redundancy disks", at which point RAID10's advantages become more clear, since each mirror group added to the striped whole instantly bumps up capacity and the best-case without compromising the worst-case. Meanwhile, you do have the advantage of having more flexibility in the quantity of disks added, but doing so doesn't help your best-case unless you feel like resuming the whole "let's make every disk a parity or mirror disk" arms race.

And before you mention it: yes, I know RAID1 isn't traditionally counted under RAID10, but a RAID10 with only one mirror group is possible to configure using tools like `mdadm`, and while this wouldn't look different from a RAID1 at first glance, it would certainly be different as soon as more mirror groups are added.

> Parity is strictly better than mirroring at redundancy

I think this remark is the core of your misunderstanding here. Are you really trying to claim than an n-sized parity array with n-1 parity blocks is better than a RAID1? You might want to clarify your position there if that's not what you meant to suggest :)

> Parity exploits that lack of correlation completely (any disk failure is just like any other, it doesn't matter which disk is which)

Which is exactly why you're incorrect. Since any disk failure is like any other, every single disk has to be unique in order for a parity-based array to benefit from any diversity. With mirroring, you don't have to go nearly as far for that same benefit; you just need each mirror to have a different bathtub curve. That's my point.

--

This has been a fun discussion, and there is certainly lots of room for debate on the merits of mirroring v. parity, but I'm starting to think that we should just agree to disagree on this one.


>You're getting caught up on redundancy for a given total number of disks, and in the process missing the point entirely.

That is the only point. Let me state it like this: you have N disks of which you want to get M capacity out of (M < N), what's the RAID configuration that maximizes redundancy. My assertion is that parity raid is strictly better than RAID10 at doing that, you haven't really refuted it.

>From there, my point should be more clear. Just like how you can constantly add more and more parity blocks (and disks to handle them), I can constantly widen the mirror.

Of course you can, and for any given number of disks you'll always be behind on redundancy. You can of course just throw more hardware at the problem and if your choices are between RAID6 (2-parity) and RAID10 then RAID10 eventually wins because you need more parity disks than RAID6 provides to take advantage of the extra hardware. Since most block RAID implementations only do 2 or at most 3-parity, RAID10 ends up being the only practical solution for lots of disks. But for a 4 or 5 disk setup (most NAS applications for example) RAID6 or ZFS 3-parity support is a better choice if the performance tradeoff is workable (as it is in most NAS).

>because with a n-disk array, a RAID1 will always be at least as redundant as the equivalent parity-based RAID

This is a red herring, yes there is the lower bound of no implementation with N disks can survive the loss of all N disks, and if you want the space of only a single disk RAID1 of N disks is the obvious choice.

(really, it'll always be able to survive one more disk failure than an n-1 parity-block array; you can see this rather easily when comparing RAID5 with a three-way RAID1).

This is again wrong. A 3 disk RAID1 has the size of 1 disk and a 3 disk RAID5 has the size of 2 disks so they're not the same. To make it comparable you need to make the array RAID6 at which point you have the same space and redundancy. You'd still do a RAID1 though as for the 1 disk size case there is no added benefit of parity.

>RAID10's advantages become more clear, since each mirror group added to the striped whole instantly bumps up capacity and the best-case without compromising the worst-case. Meanwhile, you do have the advantage of having more flexibility in the quantity of disks added, but doing so doesn't help your best-case unless you feel like resuming the whole "let's make every disk a parity or mirror disk" arms race.

This is missing the point entirely. The parity normal case is the same as the RAID10 best case and better than the worst case. There's no real way around that.

>Which is exactly why you're incorrect. Since any disk failure is like any other, every single disk has to be unique in order for a parity-based array to benefit from any diversity

This is just not true. Having diversity means that failures become uncorrelated. Parity RAID explores that even better than RAID10. The property you keep bringing up is basically saying "since RAID10 has worse redundancy than n-parity it benefits more from uncorrelated disk failures" which is true but only by partially mitigating a problem that parity RAID doesn't have.

>This has been a fun discussion, and there is certainly lots of room for debate on the merits of mirroring v. parity, but I'm starting to think that we should just agree to disagree on this one.

I think that by now I've explained my point clearly enough and even repeated myself a bit so yeah, I'll drop it here.


Sorry, I promised I'd let us agree to disagree, and I will after this, but I just can't help myself on this one point:

> This is again wrong. A 3 disk RAID1 has the size of 1 disk and a 3 disk RAID5 has the size of 2 disks so they're not the same. To make it comparable you need to make the array RAID6 at which point you have the same space and redundancy. You'd still do a RAID1 though as for the 1 disk size case there is no added benefit of parity.

Pardon, but what?

Think about this for a second.

For one, now you're moving the goalpost by shifting gears to disk capacity alone (which, as I've mentioned repeatedly, is a non-factor if you're considering anything mirror-based). With the same number of disks, mirroring always beats parity in a RAID. There's no getting around that.

Don't believe me? Take those four disks (you need four at minimum) in your RAID6. Your array can survive two disk failures. Now I build a RAID1 with four disks. Mine can survive three disk failures. Yes, there's a capacity hit, but it's been repeatedly established that it's not capacity that matters.

This still holds true when comparing a three-disk RAID5 and a three-disk RAID1. This still holds true when comparing any n-disk n-1-parity-block array with any n-disk RAID1 with the same value of n.

Your points are much more valid when they're targeted at RAID10 instead of the concept of mirroring in general.


I've explained all this so there's nothing new here at all. I'll try again if it helps.

>Don't believe me? Take those four disks (you need four at minimum) in your RAID6. Your array can survive two disk failures. Now I build a RAID1 with four disks. Mine can survive three disk failures.

Apples and oranges again...

> Yes, there's a capacity hit, but it's been repeatedly established that it's not capacity that matters.

This is not established. This is something you keep saying but makes no sense. My NAS has 4 2TB disks, I've configured it as RAID6 for 4TB of space where any two disks can fail. RAID10 is strictly inferior to that and RAID1 would only give me 2TB of space which is not enough for me. This result is generalizable. For a given number of disks and a given target array usable size the n-parity RAID configuration always has superior redundancy to the RAID10 configuration, there's no way around that.

>Your points are much more valid when they're targeted at RAID10 instead of the concept of mirroring in general.

My points have always been about RAID10 and not RAID1. If RAID1 is enough that's always the better option as obviously it's not possible to improve on "you have N disks and N-1 of them can fail".

Note that's also the reason RAID5 implementations are limited to 3 disks minimum and RAID6 to 4 disks minimum. In theory you could configure a RAID5 with 2 (1+1parity) and a RAID6 with 3 (1+2parity) but since it doesn't make sense to calculate parity of a single disk that just ends up meaning RAID1.


Not to mention the "write hole" where a single write requires 4 reads on RAID5. RAID0+1 provides better write speeds because less compute and fewer reads are needed.

Striped RAIDs are a thing of the past. The closest we could come today is, effectively, "spanning datacenters".


Worth mentioning that RAID1+0 and RAID0+1 are two different things (striped mirrors versus mirrored stripes, respectively), and have two different probabilities of multi-drive failure causing total array failure (long story short, RAID10 has a better chance of surviving a multi-disk failure than RAID01 without any sacrifice in capacity or performance).

But yes, the performance boost is certainly quite fantastic.


People who use filesystems like this seem to completely misunderstand what RAID is for. They seem to care about their data somewhat, which of course means they have off-site backups.

Given that they have reliable backups, redundancy is only useful to the extent that it improves uptime. Uptime is best maintained by eliminating single points of failure. Raid is a good first step, but at some point, creating a massive disk array on a single motherboard/cpu/disk-controller is anything but that. And they don't seem to be complaining of downtime.

Given that they have already achieved data safety, and don't care about uptime, the only reasonable explanation I can surmise is that they're confused.


I tend to agree. RAID 1 is fine for your desktop where you don't want to waste time restoring your backup when a drive inevitably dies. When the kernel gets confused and says both drives are corrupted simultaneously, it's easy to manually repair with dd. (See which block has unrecoverable errors and copy that one over from your other drive. Repaired. Not sure why the kernel can't do this automatically. The data is gone, you can't make it worse.)

For datacenter applications, having everything on one machine behind a filesystem abstraction seems like unnecessary complexity, not to mention a single point of failure and a waste of resources. I'd much rather handle things on the application level; if I want three copies of my data around, write it to three different machines with disks. When a copy seems unavailable, make an extra to keep the redundancy level the same across disk failures. If I want some redundancy encoding, just store that instead of the raw file. Now there is no complicated recovery when your disk fails. The system is always in a recovered state because disk failures are expected and uneventful.

That said, there is a bit of complexity in implementing this, and even more complexity when you want all the pleasantries of a UNIX filesystem (permissions, quota, etc.). I know of only one working implementation of this technique, and it's not open source. And, of course, if you're going to use existing software rather than writing your own, you're going to need a UNIX filesystem for it to use, not some clever chunk storage manager you wrote. So btrfs and zfs persist.


In your opinion, what is the best way to address the use case of spreading I/O across multiple spindles for non-uptime-related performance reasons?


Who even cares about spindles anymore?

Need fast? SSD on the PCI bus. Need lots of storage? Spinning disk where you care not about performance.


> Who even cares about spindles anymore?

Folks who have a shitload of data to store, but want more perf than a single drive can give them?


Using any of the btrfs supported RAID levels will improve read performance, and anything but level 1 will improve write performance.

I like Raid10.


There are lots of areas in which the data is reproducible but local continuity (and checksumming!) are still extremely valuable. Think things like large iso files, dvd rips or even intermediate stages of data analysis. Since I can reproduce them having a backup might not make sense but I still don't really /want/ to have to go to all that effort.

Especially things like scratch space for analysis you really don't want bit flips since you've maybe silently invalidated later stages.


Did you read the article? Did you read my post? Who are you replying to?

Checksumming is great. Improving uptime (in your case, by not having to recreate the data) is great. Why do you assume I don't agree with you?


"They seem to care about their data somewhat, which of course means they have off-site backups."

I don't think a reply of "there are classes of data that one can care about and still not have backups of" is unreasonable.

"...I can surmise is that they're confused"

Again, I think showing a few cases in which users can have perfectly non-confused reasons for using these filesystems is pretty straightforward as conversational gambits go.


The article wasn't about using RAID, it was complaining of a lack of RAID60 and triple parity in btrfs, and a lack of volume expansion features in zfs.


Well, yes but I don't think either of those things are really addressed by "care about data IFF backups exist". I guess I either don't get what point you're making or we're just getting distracted by phrasing.


Does the off-site backup also contain the write that happened 1.5 seconds ago? What about the corruption that happened just before taking the backup?


This is a ridiculous scenario. If you have some files on your laptop that are super important, you probably shouldn't `rm -rf` them 1.5 seconds after uploading to your NAS. If you were working on that file, you shouldn't wait to save it until you are finished, or should be prepared to redo that work. Save it as you go.

Anyway, if a drive fails 1.5 seconds after a write, any of the btrfs supported Raid levels (-0) will save you. If corruption happens just before a backup, no Raid level will save you, because Raid does not protect against undetected silent corruption. Btrfs may save you there, because it does computes checksums -- allowing you to detect the corruption and recover.

If you really are going to immediately purge some critical data from its source location 1.5 seconds after fsyncing to your drive, you should be sending it to multiple locations and waiting for confirmation containing a valid checksum before considering the write a success.


> This is a ridiculous scenario. If you have some files on your laptop that are super important, you probably shouldn't `rm -rf` them 1.5 seconds after uploading to your NAS.

There is nothing ridiculous about a hard drive breaking between backups. There are situations where loosing all data after latest backup would be, if not unrecoverable, at least extremely irritating. Having redundant drives in a laptop might not be feasible, but dedicated workstations are still used in many places.

> If you really are going to immediately purge some critical data from its source location 1.5 seconds after fsyncing to your drive, you should be sending it to multiple locations and waiting for confirmation containing a valid checksum before considering the write a success.

Which is exactly what ZFS/btrfs in a mirrored RAID configuration does.


Wonder if anyone has ever thought about building a SQL server directly in a hard drive?


Because what everyone wants is more invisible closed-source software running on underpowered CPUs? Not to mention the limitations on performance, capacity, and reliability scaling that would be inherent in binding the database instance to a single disk device. Or were you suggesting that each of these tiny controllers running deeply proprietary software should also form a distributed database in concert with host software or HBA controller firmware?

To say that this is not a good idea would be putting it mildly. The existing efforts to build more functionality into disk drives, starting with FDE and now with some object-store-like interfaces, sounds real appealing at first. Less software to write, functionality everywhere, disks seem to Just Work today so wouldn't it be nice if they could Just Work in some more ways too. However, as soon as you start thinking about building something larger or more interesting than a toy, it becomes apparent that the disk drive is the wrong place for this. The knowledge of the problem is not present at that level of the system, and the interfaces and processing power are inadequate to express it. The problems of error recovery, scaling, and debuggability cannot be solved there. It might be useful for a few of the very smallest consumer-grade applications where none of these concerns are significant (nor likely to be solved by a higher-level vendor anyway), but it's not generally viable.


I think more interesting question would be if anyone has build practical userland on top of SQL (or any reasonably rich database for that matter).

Here is some discussion about SQL on raw devices: http://dba.stackexchange.com/questions/80036/is-there-a-way-...


The Newton skipped SQL but had an object database with queries.


There are a few drives out there which have a k/v interface as opposed to block interface. Seagate Kinetics come to mind:

http://www.seagate.com/tech-insights/kinetic-vision-how-seag...


One of the big log search startups was doing postgres query engine integration on drive controllers at least 10 years ago. (sorry, don't recall which)

And as far back as 1964 IBM was doing K-V in hardware on the drive in their count-key-data devices. [1]

[1] https://en.wikipedia.org/wiki/Count_key_data


https://en.wikipedia.org/wiki/WinFS

edit: apologies, you meant on the controller presumably


Isn't COW a fundamentally hard problem? How does one expect a complete, comprehensive solution?


s/tripple/triple/g




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: