Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Merging bcachefs (lwn.net)
257 points by rascul on June 17, 2023 | hide | past | favorite | 83 comments


> Brauner said that he thinks bcachefs is in "excellent shape to be upstreamed", but he is concerned with the number of filesystems in the kernel; he is glad to see that there are efforts to remove some of them. Changes that impact all of the filesystems in the tree "get painful very very fast" and, in some cases, there is no one available to review the changes. He would like the acceptance process to be more conservative; accepting NTFS/NTFS3 was "a huge mistake", for example.

As someone not familiar with the filesystem related parts of the kernel, it's quite surprising to hear this. It sounds like filesystems are a lot more integrated (or at least less modular) than other kernel subsystems.

Anyone else who found this surprising, I recommend reading this other LWN article that is referenced as well: https://lwn.net/Articles/886708/

However, that mostly discusses issues caused by having 32-bit data structures, and how they will cause issues in 2038 when it's no longer able to handle the timestamps required. Specifically for ext3, NTFS, and ReiserFS.

But other than that issue, I don't really understand why it's difficult to simply rip out support for a specific filesystem. Compared to something like a driver for a PCIe or USB device, what makes filesystems so much more integrated and difficult to remove?


> As someone not familiar with the filesystem related parts of the kernel, it's quite surprising to hear this. It sounds like filesystems are a lot more integrated (or at least less modular) than other kernel subsystems.

AFAIK, on Linux filesystems are closely coupled with the memory management subsystem and the directory and inode caches. It's part of the reason why filesystem access on Linux is so fast.

> But other than that issue, I don't really understand why it's difficult to simply rip out support for a specific filesystem. Compared to something like a driver for a PCIe or USB device, what makes filesystems so much more integrated and difficult to remove?

It's not that unusual for users to have a partition containing a filesystem (sometimes, but not always, on an external drive) surviving unchanged through several migrations to newer Linux distributions and/or hardware. Compared to something like a PCIe or USB device, filesystems have a longer life.


That's more of a reason why it's hard to change the core MM/page cache/etc logic, as you have to change all the users of it (every filesystem, etc).

Ripping out a filesystem is, technically, near-trivial. The downsides are fully human: the Linux kernel-userspace ABI is normally considered a golden promise that shall not be broken. The developers don't want users to have a bad morning on which their old filesystems no longer work.


Buffer cache also. An article that goes into detail with an example of the coupling you're describing:

https://lwn.net/Articles/930173/


Ext2/ext4 is a little special that way, most aren't quite so intertwined.


The article mentions the proposed change would also affect these:

"including the ext4 filesystem, but also F2FS, FAT, GFS2, HFS, ISO9660 (CDROM), JFS, NTFS, NTFS3, and the device-mapper layer"


I don't know that it's technically difficult, but Linux hates breaking backward compatibility. If they remove a filesystem, some users won't be able to mount their filesystems any more. But Linux doesn't want to have unmaintained code either, so they only accept a filesystem if it's going to be maintained for the next 10-20 years.


The only "easy path" is if there exists a working fuse implementation of the same filesystem. Then there is a good fallback path for the users who still need support.


You can always run the kernel implementation as a fuse filesystem with guestmount.

Side note: I feel almost gaslit, it's really hard to find any mention of guestmount running a VM outside this single page https://libguestfs.org/guestfs-internals.1.html


That's an interesting approach.

It's too bad linux can't already run any in-kernel filesystem as a user process via FUSE, when you prefer greater isolation in exchange for worse performance, at the flip of a mount option.

There's no technical reason for this to not be possible IMO... it's just a product of the tight coupling of everything in-kernel, as implemented today.

I'm short on time to confirm at the moment, but I believe it was this [0] talk that left me with the impression kernel devs were exploring general solutions of this nature.

[0] https://www.youtube.com/watch?v=xjv8Jv58bMs


Except in this case there will no longer be such a thing as "the kernel implementation". You'd have to run an older kernel, too. That's a ticking time bomb, more a migration/recovery strategy than something one could recommend longer term.


It's really not a "ticking time bomb". The older kernel is only running as a guest in a VM. It will eventually stop working (e.g. when Intel changes x86), but it's not really a security issue.


guestfs has some really useful tools that are, as you've noted, seemingly esoteric.


Well, till it is your boot FS. and there is performance penalty too


Hard to maintain because all this code has to handle common thing (file, directory, etc) in unique way. A big mapping mess.

Hard to remove because someone, somewhere uses it and Linux doesn't like to break things for the user.


It surprised me too. I'd have expected the infra code to be lacking when there's only 2 or 3 filesystems in the kernel, but after multiple filesystems had been added, I'd have expected the kinks to be ironed out and the common code to be generic enough to support almost any filesystem.


>People are using the snapshots to do backups of MySQL databases, he said, which is a test of the robustness of the feature

This is exciting because db workloads are btrfs's kryptonite. The only way to avoid crippling fragmentation on btrfs is to disable copy-on-write, which also disables checksumming hence nullifying one of btrfs's main selling points. ZFS seems to handle such loads much better, and it would be interesting to see how bcachefs deals with them.


Side note:

First mention of bcachefs here on HN (13years ago) and now it's gets merged into kernel.

What an awesome achievement!

https://news.ycombinator.com/item?id=1720077


That's bcache, not bcachefs.


We can't take it for granted yet that it will be merged


Some ~30 preparatory bcachefs patches have been posted for review for a month or so now, there was some discussion, but mostly negative discussion.

Does anyone know how this actually works, is it going well for that patchset, is it getting closer to being merged?


This is normal. The positive things don't need to be talked about.

It is extremely unlikely that bcachefs will be merged as-is. This is true for anything of it's size. (These days... There have been large subsystems in the past that were merged in unacceptable state with the promise that eventually they will be fixed. They usually weren't. Which is why the bar is so high now.) But this just means that there will need to be debate and work to hammer it into a shape that is acceptable for the kernel. This can be a long process, but I don't think there is anything fundamentally wrong in bcachefs that would exclude it.


ok, thanks. At least (as the article notes) there seems to be progress on the code generation question which was maybe the biggest objection ("proposed JIT allocator").


It might sound trivial, and I understand the history behind it, but I wonder if it is wise to put the word "cache" in the name of what is meant to be a durable, persistent storage component.


I thought about naming it kentfs, but prior precedent makes that a dubious idea :)


Yeah, this naming convention has not proven overly successful in Linux file systems, I would pass :-)


It's time for filesystems to escape the latin alphabet. ΩFS anyone? ∞FS?



For anyone without an LWN sub (though it's a good idea!) the post is just digging into the details of how bcachefs is now in a position to be merged into the Linux kernel if Linus allows it, as well as some of the concerns about long term support given its complexity.

bcachefs was discussed recently on HN – https://news.ycombinator.com/item?id=35899527 – and is a file system with COW, a GPL compatible license (licensing is an area where ZFS can be tricky, to put it lightly), and ext4 levels of performance.


The submission is a SubscriberLink which gives courtesy access to anyone who has the link, just to note.


I was expecting that to be the case, but when I opened a non-logged in browser and clicked it (just to test), it sent me to the LWN login screen, so it may have been updated since or I encountered a bug. It now appears to be okay!


What I really would like: ext4 with snapshots and transparent compression where it may improve performance.


It's kludgey to graft snapshots and compression onto a classic in-place filesystem, so you'd probably end up with something slower and more complex than ZFS/btrfs/bcachefs. I remember tux2/tux3 was touting its simplicity but I don't know what happened.



Dave Chinner somehow managed to put reflinks on top of XFS, although it's not nearly the same as full filesystem snapshots. I believe there was discussion about implementing proper snapshots, but XFS is very conservatively developed, so that may be many years in the future, if it ever happens.


I think LVM or DM provide snapshot and compression capabilities which you could layer under ext4.

I like ext4 for being simple and fast and having options to turn off the unnecessary stuff. It's great for gigabytes of caches, tempfiles, build artifacts, scratch files etc. The bookkeeping and indirection needed for those features would be a waste of CPU cycles for shortlived data.


I want ext4 to remain rock solid and reliable, and LVM to somehow gain btrfs's flexible block allocator so we can have easily expandable heterogeneous pools of drives. I think by this point it's clear btrfs is never gonna turn into Linux's ZFS so perhaps it'd be wise to implement its good ideas in other systems instead.


so essentially btrfs/bcachefs without the focus on being used for RAID?


btrfs don't allow nodatacow with snapshots .

I know how these two features have some conflicts, but it meant we can't have database workload with decent performance on btrfs. -- brtfs snapshot just can't scale, the cow have too high performance hit.

ZFS handles snapshots with database workload just fine.


ZFS is cow, so if it works for your load, cow doesn't seem to be the problem?

If you're willing to switch OSes, FreeBSD UFS is a more traditional filesystem, with optional snapshots (modifications to files in a snapshot have to be cow, of course)


> ZFS is cow, so if it works for your load, cow doesn't seem to be the problem?

Btrfs' implementation of cow, of course.

They have no intention to make it work for database workload. When you ask why it's slow, they just ask you to disable cow ( which requires recreating the file -- something you would die to avoid with multi TB database)


Have you got any recent snapshot benchmark available? I can find only one ancient one.

> btrfs don't allow nodatacow with snapshots

What's the use case for a snapshot at that point? Isn't it the same as making a copy of the files?

I'm not even sure what would this look like... CoW is what enables snapshots. Without CoW you can't get a consistent copy anymore.


I guess it would look like LVM snapshots? CoW only when a snapshot is present.


That's how btrfs chattr +C files behave already.


> What's the use case for a snapshot at that point? Isn't it the same as making a copy of the files?

As you say, it's a consistent copy. cp(1) won't give you that.


Neither does a filesystem snapshot unless the snapshot is aware of every application's consistency semantics. You can't simply assume that every flush() or other write barrier means that the application data is now in a consistent state on-disk.


These days, with production systems being componentized across VMs, abstractions like LVM in use, etc., the filesystem is no longer a fixed "feature of the deploy environment" chosen separately from the needs of the application, that needs to cope with any random thing the production application might do; but rather, a prod deployment is designed as a whole, where you choose the filesystem and plan the use of its features relative to what application you'll be deploying on it. (And, in fact, for a stateful application like a DBMS, you'll probably have one or more separate volumes just for the application state — so the filesystem you use to solve application-state problems doesn't even need to be good at being a rootfs for an OS; it only needs to be good for your application use-case.)

Under this paradigm, rather than trying to make a filesystem that understands applications well enough to snapshot them, you instead make applications that have filesystem snapshots as part of their conceptual model.

In the lower-effort version of this approach, you have software like Postgres, where the application layer can be told "I'm going to use filesystem tooling to take a consistent snapshot, so make your on-disk state consistent for a while." In PG, you'd call pg_start_backup(), which will flush all pending writes to the table files to disk, and then spool all future writes purely in the WAL journal until the backup completes (i.e. until you tell it pg_stop_backup()) — at which point all those pending changes get replayed out to the table files.

In the higher-effort version of this approach, you have software that has one or more CoW filesystems it natively understands and integrates with — where you never directly address the filesystem at all, but rather, you tell the application to take an application level snapshot; and then the application uses the filesystem CoW features, together with its own consistency primitives, to efficiently achieve that (which might not necessarily result in something that's a "filesystem snapshot" from the FS's perspective, but rather just a bunch of individual CoW-cloned files in a directory.) I believe that Oracle DBMS does this, though I might be wrong.


Not "every application". A single database engine.


So I guess you'd have to temporarily turn on CoW, take a snapshot, make sure all the files are fully duplicated, turn off CoW to achieve that. Since you can't flip the setting at runtime right now, that feature seems quite far away.


In brtfs, you can't.

You need to copy the file over to enable/disable cow


> btrfs don't allow nodatacow with snapshots .

Uhh. "You can't do overwrite-in-place if you want to keep a snapshot copy of the old data" simply makes sense, and you'll find every overwrite-in-place design that supports snapshots will take a write performance hit around the time the snapshot; either the snapshot has to atomically copy the whole data, or the first write after a snapshot can't be overwrite-in-place (or it has to make a separate copy of the original data for the snapshot; similar but worse).


Proper CoW means that every change to a block has to cause copies, not just the first write after a snapshot. That makes a big difference.

And "move the snapshotted data out of the way upon writing" is going to give you better performance in a lot of cases.


I was replying to this part:

> btrfs don't allow nodatacow with snapshots.

You can absolutely mix chattr +C with snapshots. It's CoW-when-needed in the face of snapshots, just like everything has to be in order to support snapshots.


So they were just wrong in saying it's not supported? You didn't make that clear.

And yes I can find many sources saying it's supported.


Check VDO device mapper module.


How long til it would be recommended to move from ZFS to BcacheFS on Linux?


10+ years, considering how Btrfs has played out.

Btrfs was merged in 2009, but didn't gain wide acceptance until quite recently. Among the biggest distros, only Fedora uses it as default, and RHEL has actually dropped support. Even today, you can still find people claiming that they lost data because of it, though whether they are telling the truth I cannot say.


btrfs was merged in Linux in a very different state than bcachefs is now; bcachefs already has about ten years of development behind it, whereas btrfs "only" had two years of development behind it when it was merged. It would almost certainly not be merged today.

I'm not saying you should switch all your critical systems to bcachefs on the day it gets merged as you can never be sure about the absence of bugs (even the relatively simple ext4 had some data-eating bug a few years after introduction), but the path to "recommended filesystem" will be a lot shorter than btrfs.

At this point, I would already feel comfterable running bcachefs on my laptop – the only reason I don't is that I just can't be bothered running a custom kernel for it.


How does bcachefs deal with laptops just dying due to lack of power and then fsck and recover on boot?


Works fine. Been using it on my laptop for ~7 years, users say it's solid w.r.t. power failure too.


Awesome, can't wait for this to become linux's best filesystem.


Apparently, just fine. On paper, do a quick check and clean on mount. There are mount options for full check, degraded and recovery modes. About the paper https://bcachefs.org/bcachefs-principles-of-operation.pdf


I believe openSUSE uses Btrfs.

I haven't lost data on it, but my 8 drive Btrfs RAID6 filesystem locked up read-only on me, which wasn't fun. Switched to ZFS after that.


I have had single-drive btrfs filesystems lock up on me twice; i.e. file manipulation commands like `ls` freeze and "take forever". In both cases, there was a hardware failure of the drive. I believe that the frozen processes went into uninterruptible sleep (so they couldn't be killed).

It is totally understandable that if the underlying driver starts returning errors, btrfs (or any other filesystem) may be unable to provide access to my data; but I was not at all happy about having to reboot the entire machine.

(I must admit, it is possible that the "freezing" was happening in underlying block device driver code and not in btrfs. I don't remember if I ever checked wchan to see which it was. My impression from reading dmesg output was the issue seemed to be with btrfs.)

I did once have a similar experience with ZFS as well. Sigh.


Couple weeks ago had someone complain about system updates and other mildly IO heavy things making his system extremely slow and lock up for minutes at a time, despite having an NVMe SSD. Rebalancing fixed it - which iirc I also had to do regularly about ten years ago when I had the exact same problems with btrfs.


Yeah, OpenSUSE does BTRFS although they also support XFS. I ran tumbleweed for a while and managed to lose a root filesystem twice, although that was a few years ago and appeared to be triggered by running out of space, as the home filesystem on the same box, also BTRFS, survived both times.


RAID5/6 is still quite beta in btrfs. mdadm+btrfs is the way to go here.


Actually any RAID is beta in btrfs, but configurations with one storage device (e.g. firmware of Turris Omnia) do quite well on prod.


No, it's not beta. You can check the status here https://btrfs.readthedocs.io/en/latest/Status.html

Only raid5/6 has known issues.


This was also about 8 years ago. It was originally a RAID1 setup, but I did an in-place converter to RAID6, which is cool. And like I said, I didn't lose any data.

With mdadm, I don't get auto-repair on top of the checksumming.


I got a Btrfs RAID1 corrupted 3 times, so I was loosing faith in Btrfs, then I ran memtest and found out I had a bad memory stick causing real data corruption, not a Btrfs bug.


It is quite understandable that something as critical when it comes to file storage takes a while to be adopted and being tested on less important data first. I'd say this reflects positively on IT.

A nice (and lone) data point regarding btrfs, though: My dentist told me that he was storing patient data using btrfs last year. :)


I can't help but wonder how that conversation happened: Ya so we're going to root canal your molar, ah and by the way patient data is on btrfs which is only slightly worse...


Oh you never lost data with btrfs, but you might give up following the trail of mailing list posts and wikis one fine spring morning when it failed to mount.

ISTR hours-long repair operations and a level of desperation that a lot of people wouldn't have had time for. This is assuming you didn't use its RAID or other features that definitely would eat your data.

(This is experience from 10 years ago when I was looking for the latest and greatest features to support a hosting platform)


Btrfs is significantly more complex and very different case. Bcachefs is designed by single guy with different motivation.

Main problem with Bcachefs is its in unknown state. There was no major testing done on that. When some company (like Oracle or Suse) puts it through automated stress test matrix, it may do exceptionally well or completely fail.


The article discusses having automated testing set up in conjunction with Red Hat's increasing interest in it.


I am not familiar with Red Hat, but other companies have proprietary verification tests. It takes several decades of machine time to complete.

For example Suse participated over 25 years on Ext3,4, ReiserFS/4, XFS, BTRFS.... This know how is not public!


How do you know of Oracle and SUSE but not Red Hat?


Probably meant not familiar with Red Hat’s automated testing?


openSUSE also uses it as the default, as do Synology NAS appliances.


Synology uses mdadm+btrfs for the record.

Still a pretty good endorsement in my mind.


I keep wondering this too, but at the same time, alebit with bugs, I have a ZFS pool mountable under Linux and Windows, so I have decided to not hold my breath at all. Not to mention zrepl, etc.


I recently discovered that btrfs can also be mounted under windows.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: