Hacker News new | past | comments | ask | show | jobs | submit login
Linux NILFS file system: automatic continuous snapshots (dataswamp.org)
242 points by solene on Oct 11, 2022 | hide | past | favorite | 148 comments



I think NILFS is a hidden gem. I’ve been using it exclusively in my Linux laptops, desktops etc. since ca. 2014. Apart from one kernel regression bug related to NILFS2 it’s worked flawlessly (no data corruption even with the bug just no access to the file system; effectively it forced running older kernel while the bug was fixed).

The continuous snapshotting has saved me a couple of times; I’ve just mounted a version of the file system from few hours or weeks ago to access overwritten or deleted data. I use NILFS also on backup disks to provide combined deduplication and snapshots easily (just rsync & NILFS’ mkss, latter to make sure the “checkpoints” aren’t unnoticedly garbage collected in case the backup disk gets full).


>I think NILFS is a hidden gem. I’ve been using it exclusively in my Linux laptops, desktops etc. since ca. 2014

Yes it's really sad, there we have a native and stable check-summing fs, and nearly no one knows about it.


> check-summing fs

Is it? Last I'd heard was

> nilfs2 store checksums for all data. However, at least the current implementation does not verify it when reading.

https://www.spinics.net/lists/linux-nilfs/msg01063.html


Hmm you could be right, i found nothing about that it is calculated at read-time. Just with fsck.


BTRFS is also a native copy on write filesystem that verifies a configurable checksum and supports snapshots.

The snapshots are not automatic, but short of that it is pretty feature complete


BTRFS is pretty stable nowadays.


What does that mean quantifiably?


I’ve used it for about 10 years without consequence; except for problems I had once with enabling experimental options in the kernel module (custom build.) I have used BTRFS exclusively and extensively in terms of what you can do with it since realizing it that many years ago. The only thing I haven’t used is its native RAID support. I think 0 and 1 are fine but I’m not sure about 5/6; parity was still experimental last I checked in 2016 maybe it’s reliable now. Everything else though; compression, snapshots, copy on write, online defrag has all been fine. It is the default for SuSE Leap and Tumbleweed, they use it for snapper (OS snapshots) and BTRFS sub volumes are also supported by Docker for container images and containers. It saves a lot of space when images start to add up.


Synology deploys it in their products


I think it's the default for suse these days.


It is, and openSUSE's default of "create a snapshot before and after every 'zypper in' and/or 'zypper dup'" has saved my bacon on more than one occasion.


SLES explicitly say's just use it for the system NOT Data...there you use XFS.


That's why i specifically wrote -> stable...


BTRFS is not stable?


If it’s not perhaps it should be, I’ve been using it for about 10 years without any problems that I didn’t cause for myself.


The name isn’t great, as someone unfamiliar with it, it sounds like a synonym of ‘Null FS’ and elicits thoughts of a mock filesystem for testing. In a list of filesystems I would gloss straight over it


> Apart from one kernel regression bug related to NILFS2 it’s worked flawlessly

Maybe on x86? I’ve tried repeatedly to use it on ARM for RaspberryPi where it would have been perfect, but always ran into various kernel panics as soon as the file system is mounted or accessed.


I've used NILFS2 on flash storage on some old non-RPi ARMv7 hardware for a while without a problem. Switched to F2FS for performance reasons, though.


This was on the root partition, which is subjected to a lot more concurrency issues than an sd card normally would fwiw.


True, I only have used it on x86 devices. Thanks for the heads up!

I’ve heard so many stories of SD card failures (against which snapshotting might be of no help) with RaspberryPi that I’ve decided to send any valuable data promptly to safety over a network. (Though, I personally haven’t had any problems with failing SD’s.)


NILFS is absolutely wonderful; it was very unfortunate that Linus chose to dub btrfs as the ext4 successor all those years ago, because it cut off a lot of interest in the plethora of interesting work that was going on at the time.

A decade later and btrfs is still riddled with problems and incomplete, people are still using xfs and ext4 for lack of trust, one kernel dev has a side hobby trying to block openzfs, and excellent little projects like nilfs are largely unknown.


> one kernel dev has a side hobby trying to block openzfs

Can you elaborate?


I don't know if everything has been collected in an easy-to-digest form, but GKH has gone out of his way a few times to shut out OpenZFS.


Exactly. Removing or changing the licensing on APIs in a way carefully targeted to break ZFS, but not affect actual proprietary drivers such as the NVidia graphics drivers.


> it was very unfortunate that Linus chose to dub btrfs as the ext4 successor

That quote is from Ted T'so (https://en.wikipedia.org/wiki/Btrfs#History) or do you have a link to Linus' quote?


I had issues with file locking when running some legacy database software on NILFS2. Probably caused data corruption in that database (not the FS itself).

SF website of NILFS2 suggests that there are some unimplemented features, one of them being synchronous IO, which might have caused that issue?

https://nilfs.sourceforge.io/en/current_status.html

In some cases, the NILFS2 is safer storage for your data than ZFS. So NILFS might work for some simple usecases (eg. localy storing documents that you modify often), but it's certainly not ready to be deployed as generic filesystem. It's relatively slow and sometimes behaves bit weird. If something goes really bad, the recovery might be bit painfull. There is no fsck yet, nor community support. NILFS2 can self-heal itself to some extent.

I really like the idea of NILFS2 but at this point i would prefer patch adding continuous snapshotting to ZFS. Unlike NILFS2 the ZFS have lots of active developers and big community. While NILFS2 is almost dead. The fact it's been in kernel for quite some time and most people didn't even noticed it (despite it's very interresting features) speaks for itself.

Don't get me wrong. I wish that more developers get interested in NILFS2 and fix these issues and make it on par with EXT4, XFS and ZFS... But still ZFS has more features overall, so we might just add continuous snapshots in memoriam of NILFS2.


> In some cases, the NILFS2 is safer storage for your data than ZFS.

What cases? Do you just mean due to continuous snapshots protecting against accidental deletes or such, or are there more "under the covers" things it fixes?


If the problem is anything like this:

https://www.reddit.com/r/btrfs/comments/fcntc6/qcow2_on_btrf...

You can either make a sub volume and mount it with nocow or set the +C file attribute to disable CoW:

http://blog.jim.nz/2015/12/04/btrfs-subvolume-with-nocow.htm...

Doesn’t really look like nilfs2 has this kind of flexibility, though


Yeah. i mean just due to continuous snapshots. Otherwise i don't really trust it that much. It's like this outsider kid which is kinda cool. But still needs to grow up and sell it :-D


It’s basically append-only for recent things so you theoretically you can’t lose anything (within a reasonable timeframe). I don’t know if the porcelain exposes everything you need to avail yourself of that design functionality, though.


Does NILFS do checksums and snapshotting for every single file in the system? One of my biggest complaints about file systems in general is that they are all designed to treat every file the exact same way.

We now have storage systems (even SSDs) that are big enough to hold hundreds of millions of files. Those files can be a mix of small files, big files, temp files, personal files, and public files. Yet every file system must treat your precious thesis paper the same way it treats a huge cat video you downloaded off the Internet.

We need some kind of 'object store' where each object can be given a set of attributes that govern how the file system treats it. Backup, encryption, COW, checksums, and other operations should not be wasted on a bunch of data that no one really cares about.

I have been working on a kind of object file system that addresses this problem.


It might sound weird but the hard part of what you describe is not the technology but how to design the UX in a way that you aren’t babysitting everything.

And doing that is not at all easy. For all anybody knows your cat video is “worth more” to you than your thesis paper. How can you get the system to determine the worth of each file without manually setting an attribute each time you create a file? And if you let the system guess, the cost of failure could be very high! What if it decided your thesis paper was worthless and stored it will a lower “integrity” (or whatever you call the metric)?

I dunno. Storage is getting cheaper all the time and it might just be easier to fuck it and treat all files with the same high level of integrity. Maybe it would be so much work for a user to manually manage they’d just mark everything the same?


You could always set the default behavior to be uniform for all files (e.g. protect everything or protect nothing) and just forget about it. But it would be nice to be able to manually set the protection level for specific files that are the exception.

If I was copying an important file into an unprotected environment, I could change how it was handled (likewise if I was downloading some huge video I didn't care about into a system where the default protection was set to high).

I agree that if you have 100 million files, then it could be nearly impossible to classify every single one of them correctly.


A directory basis, or even better, a numerical priority that could be manually set in the application that generated them, or automatically, based on the user or application or in a hypervisor, based on the VM. Then it could be an opportunistic setting.

I thought ZFS had some sort of unique settings like this.


I’d think on a directory basis would be the ideal


Why is it ‘wasted’? Those things are mostly free on modern hardware.

The challenge with your thesis here is that the only one who can know what is ‘that important’ is YOU, and your decision making and communication bandwidth is already the limiting factor.

For many users, that cat video would be heartbreaking to lose, and they don’t have term papers to worry about.

So having to decide or think what is or is not ‘important enough’ to you, and communicate that to the system, just makes everything slower than putting everything on a system good enough to protect the most sensitive and high value data you have.


> For many users, that cat video would be heartbreaking to lose, and they don’t have term papers to worry about.

Depends on where that cat video is / how it ended up on the disk.

The user explicitly saved it to their user-profile Downloads directory? Yeah, sure, the user might care a lot about preserving that data. There's intent there.

The user's web browser implicitly saved it into the browser's cache directory? No, the user absolutely doesn't care. That directory is a pure transparent optimization over just loading the resource from the URL again; and the browser makes no guarantees of anything in it surviving for even a few minutes. The user doesn't even know they have the data; only the browser does. As such, the browser should be able to tell the filesystem that this data is discardable cache data, and the filesystem should be able to apply different storage policies based on that.

This is already true of managed cache/spool/tmp directories vis-a-vis higher-level components of the OS. macOS, for example, knows that stuff that's under ~/Library/Caches can be purged when disk space is tight, so it counts it as "reclaimable space"; and in some cases (caches that use CoreData) the OS can even garbage-collect them itself.

So, why not also avoid making these files a part of backups? Why not avoid checksumming them? Etc.


Backups - possibly, but no one I know counts COW/Snapshots, etc. as backups. Backup software generally already avoids copying those.

They can be ways to restore to a point in time deterministically - but then they are absolutely needed to do so! Otherwise, the software is going to be acting differently with a bunch of data gone from underneath it, no?

Check summing is more about being able to detect errors (and deterministically know if data corruption is occurring). So yes, absolutely temporary and cache files should be checksummed. If that data is corrupted, it will cause crashes of the software using them and downstream corruption after all.

Why would I not want that to get caught before my software crashes or my output document (for instance) is being silently corrupted because one of the temporary files used when editing it got corrupted to/from disk?


> So yes, absolutely temporary and cache files should be checksummed. If that data is corrupted, it will cause crashes of the software using them and downstream corruption after all.

...no? I don't care if a video in my browser's cache ends up with a few corrupt blocks when I play it again a year later. Video codecs are designed to be tolerant of that. You'll get a glitchy section in a few frames, and then hit the next keyframe and everything will clean up.

In fact, most encodings — of images, audio, even text — are designed to be self-synchronizing in the face of corruption.

I think you're thinking specifically of working-state files, which usually need to be perfect and guaranteed-trusted, because they're in normalized low-redundancy forms and are also used to derive other data from.

But when I say "caching", I'm talking about cached final-form assets intended for direct human consumption. These get corrupted all the time, from network errors during download, disk storage errors on NASes, etc; and people mostly just don't care. For video, they just watch past it. For a web page, they hard-refresh it and everything's fine the second time around.

If you think it's impossible to differentiate these two cases: well, that's because we don't explicitly ask developers to differentiate them. There could be separate ~/Library/ViewCache and ~/Library/StateCache directories.

And before you ask, a good example of a large "ViewCache" asset that's not browser-related: a video-editor render-preview video file (the low-quality / thumbnail-sized kind, used for scrubbing.)


If they are corrupted on disk the behavior is not so deterministic as a ‘broken image’ and a reload. Corrupted on disk content causes software crashes, hangs, and other broken behavior users definitely don’t like. Especially when it’s the filesystem metadata which gets corrupted.

Because merely trying to read it can cause severe issues at the filesystem level.

I take it you haven’t dealt with failing storage much before?


I maintain database and object-storage clusters for a living. Dealing with failing storage is half my job.

> Especially when it’s the filesystem metadata which gets corrupted.

We're not talking about filesystem metadata, though. Filesystem metadata is all "of a piece" — if you have a checksumming filesystem, then you can't not checksum some of the filesystem metadata, because all the metadata lives in (the moral equivalent of) a single database file the filesystem maintains, and that database gets checksummed. It's all one data structure, where the checksumming is a thing you do to that data structure, not to individual nodes within it. (For a tree filesystem like btrfs, this would be the non-cryptographic equivalent of a merkle-tree hash.) The only way you could even potentially turn off filesystem features for some metadata (dirent, freelist, etc) nodes but not others, would be to split your filesystem into multiple filesystems.

No, to be clear, we're specifically talking about what happens inside the filesystem's extents. Those can experience corruption without that causing any undue issues, besides "the data you get from fread(3) is wrong." Unlike filesystem metadata, which is all required for the filesystem's integrity, a checksumming filesystem can choose whether to "look" inside file extents, or to treat them as opaque. And it can (in theory) make that choice per file, if it likes. From the FS's perspective, an extent is just a range of reserved disk blocks.

Now, an assumption: only storage arrays use spinning rust for anything any more. The only disk problems consumer devices face any more are SSD degradation problems, not HDD degradation problems.

(Even if you don't agree with this assumption by itself, it's much more clear-cut if you consider only devices operated by people willing to choose to use a filesystem that's not the default one for their OS.)

This assumption neatly cleaves the problem-space in two:

- How should a filesystem on a RAID array, set up for a business or prosumer use-case, deal with HDD faults?

- How should a single-device filesystem used in a consumer use-case deal with SDD faults?

The HDD-faults case comes down to: filesystem-level storage pool management with filesystem-driven redundant reads, with kernel blocking-read timeouts to avoid hangs, with async bad-sector remapping for timed out reads. Y'know: ZFS.

While the SDD-faults case comes down to: read the bad data. Deal with the bad data. You won't get any hangs, until the day the whole thing just stops working. The worst you'll get is bit-rot. And even then, it's rare, because NAND controllers use internal space for error-correction, entirely invisibly to the kernel. (See also: http://dtrace.org/blogs/ahl/2016/06/19/apfs-part5/)

In fact, in my own personal experience, the most likely cause of incorrect or corrupt data ending up on an SSD/NVMe disk, is that the CPU or memory of the system is bad, and so one or the other is corrupting the memory that will be written to disk before or during the write. (I've personally had this happen at least twice. What to look for to diagnose this: PCIe "link training" errors.)


And yet, I've had several block level bit flips on SSDs and NVMe over the years in high heat/high load environments that weren't that.

And one caused a kernel panic because the fs metadata now pointed towards impossible values.

So tell me again how none of this matters?


Nothing is free or even 'mostly free' when managing data. Data security (encryption), redundancy (backups), and integrity (checksums, etc.) all impose a cost on the system.

Getting each piece of data properly classified will always be a challenge (AI or other tools may help with that), but it would still be nice to be able to do it. If I have a 50GB video file that I could easily re-download off the Internet, it would be nice to be able to turn off any security, redundancy, or integrity features for it.

I wonder how many petabytes of storage space is being wasted by having multiple backups of all the operating system files that could be easily downloaded from multiple websites. Do I really need to encrypt that GB file that 10 million people also have a copy of? Am I worried if a single pixel in that high resolution photo has changed due to bit rot?


Worried if it gets lost or mangled? Not necessarily. Worried if it’s happening and I have no idea, and it’s spreading - due to hardware or software problems? And I’ll only discover it when something I do actually really care about and can’t easily replace? Absolutely!

The big issue regarding duplication is really more a identification/‘supply chain’ issue. The last thing I want to be doing is trying to figure out how to get that other file from somewhere (that works), from whoever ‘has another copy’, when it gets mangled and I need a replacement ASAP.

If you’re thinking of our local filesystems as potentially just a cache, we have no easy or secure way right now to fingerprint or recreate the other entries in the cache from sources (minus browser caches or the like, but even then, re-retrieving it may return different or not content).

So backups are really keeping the computing equivalent of local ‘ability to manufacture’ handy as a mitigation against risk. It’s not waste, anymore than keeping the ability to manufacture it’s own tanks and weapons in-country is a waste for a nation. It’s an insurance cost against real world problems, and causes a moral hazard with other actors if not done.

And CRC32/CRC32C (even in Java!) is > 16GB/s per core in modern processors, and more than adequate for block sized (typ. < 32MB) checksums. Blake3 is > 2GB/s per core, and is more than adequate for… well every use case we’re currently aware of.

Modern OS’s really don’t have any excuses for not doing it.


>Do I really need to encrypt that GB file that 10 million people also have a copy of?

Indeed you don't. Poettering has a similar idea in [1] (scroll down to "Summary of Resources and their Protections" for the tl;dr table), where he imagines OS files are only protected by dm-verity (for Silverblue-style immutable distros) / dm-integrity (for regular mutable distros).

[1]: https://0pointer.net/blog/authenticated-boot-and-disk-encryp...


You can turn off CoW, checksumming, compression, etc at the file and directory levels using btrfs.


Indeed. You can also make a directory into a subvolume so that that directory is not included in snapshots of the parent volume.


I have been spinning my wheels on personal backups and file organization the last few months. It is tough to perfectly structure it.

I think directories or volumes having different properties and you having it split up as /consumer-media /work-media /work /docs /credentials etc may be the way to go.

Then you can set integrity, encryption etc separately, either at filesystem level or as part of the software-level backup strategy.


> Does NILFS do checksums and snapshotting for every single file in the system?

NILFS is, by default, a filesystem that only ever appends until you garbage collect the tail. It doesn't really "snapshot" in the way that ZFS or btrfs do, because you can just walk the entire history of the filesystem until you run out of history. The snapshots are just bookmarks of a consistent state.


Well you can do that kind of with zfs filesystems, and the "object" is the recordsize.


I was going to ask: "Is there any limit on the number of ZFS filesystems in a pool?" Google says 2^64 is the limit.

Couldn't one just just generate a filesystem per object if snapshots, etc., on a per object level is what one cared about? Wonder how quickly this would fall over?

> Backup, encryption, COW, checksums, and other operations should not be wasted on a bunch of data that no one really cares about.

This GP comment is a little goofy though. There was a user I once encountered who wanted ZFS, but a la carte. "I want the snapshots but I don't need COW." You have to explain, "You don't get the snapshots unless you have the COW", etc.


On Btrfs, you can mark a folder/file/subvolume to have nocow, which has the effect of only doing a COW operation when you are creating snapshots.


And that may work for btrfs, but again at some cost:

"When you enable nocow on your files, Btrfs cannot compute checksums, meaning the integrity against bitrot and other corruptions cannot be guaranteed (i.e. in nocow mode, Btrfs drops to similar data consistency guarantees as other popular filesystems, like ext4, XFS, ...). In RAID modes, Btrfs cannot determine which mirror has the good copy if there is corruption on one of them."[0]

[0]: https://wiki.tnonline.net/w/Blog/SQLite_Performance_on_Btrfs...


Yup. It’s a pretty fundamental thing. COW and data checksums (and usually automatic/inline compression) co-exist that way because it’s otherwise too expensive performance wise, and potentially dangerous corruption wise.

For instance, if you modify a single byte in a large file, you need to update the data on disk as well as the checksum in the block header, and other related data. Chances are, these are in different sectors, and also require re-reading in all the other data in the block to compute the checksum. Anywhere in that process is a chance for corruption of the original data and the update.

If the byte changes the final compressed size, it may not fit in the current block at all, causing an expensive (or impossible) re-allocation.

You could end up with the original data and update both invalid.

Writing out a new COW block is done all at once, and if it fails, the write failed atomically, with the original data still intact.


> Chances are, these are in different sectors, and also require re-reading in all the other data in the block to compute the checksum. Anywhere in that process is a chance for corruption of the original data and the update.

Not much different than any interrupted write though. And a COW needs to reread just as much.

> If the byte changes the final compressed size, it may not fit in the current block at all, causing an expensive (or impossible) re-allocation.

Something that you must always pay in a COW filesystem anyway? Is handled by other non-COW filesystems anyway.

Just because a filesystem isn't COW doesn't mean every change needs to be in place either. Of course, a filesystem that is primarily COW might not want to maintain compression for non-COW edge-cases and that is quite reasonable.


Literally none of what you are saying is true.

A COW write only needs to checksum the newly written bytes. A non-cow filesystem needs to checksum all data contained in the block (unchanged prior with now new values).

Additionally, a non-COW filesystem needs to update all metadata checksums/values for existing blocks. It's a much more pathological case (interrupted write wise) than a COW filesystem, because if it writes data, but hasn't written the checksum yet - the block is now corrupt. If it writes the checksum, but not the data, the block is now corrupt. And there is no way to know which one is correct post-facto, without storing the old data and the new data somewhere. Which has the exact same overhead or worse as COW. And the data in a FS block is usually many large multiples of the sector size, which makes writes pretty hard to do in any sort of atomic way. Journaling helps, but not with performance here! Since you'd need to store the prior values + the new values, or you're still guaranteed to lose data.

Compression wise, this isn't (as much) of an issue for COW filesystems, because it only needs to compress the newly written data, which can be allocated without concern to the previous allocation size, which is still there, allocated. It can mean less efficient compression if these are small, fragmented writes of course, which is why most of them have some sort of batching mechanism in place. Alternatively, it can copy a chunk of the block, though that can cause write amplification, and is usually minimized.

But you don't run across potential pathological fragmentation issues, like where you compress prior blocks which now take significantly less space, or new blocks take more space and require reshuffling everywhere.


>A COW write only needs to checksum the newly written bytes. A non-cow filesystem needs to checksum all data contained in the block (unchanged prior with now new values).

No, a COW will re-read the entire block. Make the change and update the checksum and then write it back (to a new location, obviously). Way more than the newly written bytes - but way less than the entire file of course. Just as a non-COW fs will.

>If it writes the checksum, but not the data, the block is now corrupt. And there is no way to know which one is correct post-facto, without storing the old data and the new data somewhere.

Which is exactly what the journal is for - and you already have a journal. Or you don't update it in-place. Just because it isn't COW doesn't mean you always have to update in place.

>Compression wise, this isn't (as much) of an issue for COW filesystems, because it only needs to compress the newly written data, which can be allocated without concern to the previous allocation size, which is still there, allocated. It can mean less efficient compression if these are small, fragmented writes of course, which is why most of them have some sort of batching mechanism in place. Alternatively, it can copy a chunk of the block, though that can cause write amplification, and is usually minimized.

You can do exactly the same for a non-COW filesystem.

>But you don't run across potential pathological fragmentation issues, like where you compress prior blocks which now take significantly less space, or new blocks take more space and require reshuffling everywhere.

COW has the same issue. COW always need to "reshuffle"(?) data somewhere.


I think you don't understand COW or non-COW filesystems?

They don't work the way you are asserting.


I could say the same. Be more specific please. If everything I've said is wrong it should be easy to point out something demonstrably false.


For one, you get no performance benefit over non-cow unless you update in place. It’s what every ‘fast and easy’ filesystem has to do - fat (including exfat), ext3, ext4, etc.

The failure modes are well documented - and got worse in many cases trying to work around performance issues due to journaling, but the journaling doesn’t resolve the issue fully because they can’t store all the data they need without making the performance issues worse. See https://en.m.wikipedia.org/wiki/Ext4 and ‘Delayed allocation and data loss’ for one example.

This isn’t a solved (or likely solvable in a reasonable way) problem with non-COW filesystems, which is one of the reasons why all newer filesystems are COW. The other being latency hits from tracking down COW delta blocks aren’t a big issue now due to SSDs and having enough RAM to have decent caches and pre-allocation buffers.

Also, COW doesn’t need to allocate (or re-read/re-checksum) the entire prior block when someone changes something, unlike modify in place. Due to alignment issues, doing SOME usually makes sense, but it’s highly configurable.

It only needs to add new metadata with updated mapping information for the updated range in the file, and then checksum/write out the newly updated data (only), plus or minus alignment issues or whatever. It acts like a patch. That’s literally the whole point of COW filesystems.

Update in place has an already allocated block it has to deal with in real time, either now consuming less space in it’s already allocated area (leaving tiny fragmentation) or by having to allocate a new block and toss the old one, which will have worse real time performance than a COW system, as it’s doing the new block allocation (which is more space than a COW write, unless the COW write is for the entire blocks contents!), plus going back and removing the old block.

ZFS record size for instance is just the maximum size of one of the patch ‘blocks’. The actual records are only the size of the actual write data + Filesystem overhead.

ZFS only then goes back and removes old records when they aren’t referenced by anyone, which is typically async/background, and doesn’t need to happen as part of the write itself.

This allows freeing up entire regions of pool space easier, and fragmentation becomes much less of an issue.


>For one, you get no performance benefit over non-cow unless you update in place. It’s what every ‘fast and easy’ filesystem has to do - fat (including exfat), ext3, ext4, etc.

That is just a matter of priorities then. And just because you might opt to not update in place in some situations doesn't mean that you can never do it.

I'm not sure what you mean by "Delayed allocation and data loss", I don't find it relevant to this discussion at all since that isn't about filesystem-corruption but application data corruption. And COW also suffers from this - unless you have NILFS/automatic continuous snapshots. Now with COW you probably have a much greater chance of recovering the data with forensic tools (also discussed in this thread regarding ZFS) but with huge downsides and hardly an relevant argument for COW in the vast majority of usecases anyway.

ZFS minimum block size corresponds to disk sector size so for most practical purposes it is the same as your typical non-COW filesystem there. Writing 1 byte requires you to read 4 kb, update it in memory, recalculate checksum, and then writing it down again.

How you remove old records shouldn't depend on COW should it?

My only statement was that checksums isn't in any way dependent on COW.

The discussion about compression is invalid as it is a common feature of non-COW filesystems anyway.

Haven't seen a proper argument for the corruption claims. And that you get corrupted data if you interrupt a write is not a huge deal. Mind you corrupted write. Not corrupted filesystem. The data was toast anyway. A typical COW would at best save you one "block" of data which is hardly worth celebrating anyway. Your application will not care if you wrote 557 out of 1000 blocks or 556 out of 1000 blocks your document is trashed anyway. You need to restore from backup (or from a previous snapshot, which of course is typical killer feature of COW)).

There are also several ways to solve the corruption issue. ReFS for instance has data checksums and metadata checksums but only do copy-on-write for the metadata. (edit: was wrong about this, it uses COW for data too if data checksumming is enabled)

dm-integrity can be used at a layer below the filesystem and solves it with the journal https://www.kernel.org/doc/html/latest/admin-guide/device-ma...

Yes, COW is popular and for good reasons. As is checksumming. It isn't surprising that modern filesystems employ both. Especially since the costs of both have been becoming less and less relevant at the same time.


While filesystem-integrated RAID makes sense since the filesystem can do filesystem-specific RAID placements (eg zfs), for now the safest RAID experience seems to be filesystem on mdadm on dm-integrity on disk partition, so that the RAID and RAID errors are invisible to the filesystem.


> the safest RAID experience seems to be filesystem on mdadm on dm-integrity on disk partition, so that the RAID and RAID errors are invisible to the filesystem.

I suppose I don't understand this. Why would this be the case?


dm-integrity solves the problem of identifying which replica is good and which is bad. mdadm solves the problem of reading from the replica identified as good and fixing / reporting the replica identified as bad. The filesystem doesn't notice or care.


Ahh, so you intend, "If you can't use ZFS/btrfs, use dm-integrity"?


No. I don't use ZFS since it's not licensed correctly, so I have no opinion on it. And BTRFS raid is not safe enough for use. So I'm saying "Use filesystem on mdadm on dm-integrity".


>I don't use ZFS since it's not licensed correctly,

Oh look a hobby-lawyer!! Please Linus, license your code "correctly" it's called ISC not GPL.


What makes dm-integrity "safer" than zfs or btrfs raid?


we used NILFS 15 years ago in dejaview - https://www.cs.columbia.edu/~nieh/pubs/sosp2007_dejaview.pdf

We combined nilfs + our process snapshotting tech (we tried to mainline it, but it didn't go, but many of the concepts ended up in CRIU though) + our remote display + screen reading tech (i.e. normal APIs) to create an environment that could record everything you ever saw visually and textually. enable you to search it and enable you to recreate the state as it was at that time with non noticeable interruption to the user (processes downtime was like 0.02s).


Super cool work. Are any tools like this available today? I know some VM tools have snapshotting but full history and high speed scrubbing sounds awesome.


sadly (as with much work form phd students like I was), the closest one could get to it today is trying to duplicate it.

i.e. combining criu with nilfs (but a lot of the work that we did to get process downtime to minimal numbers requires being in kernel, as described in paper) and unsure criu can do it.

In addition our screenrecording mechanism was our own "proprietary" (not really proprietary as fully described in research papers, but also not a standard) and something that was built as an X display driver 15 years ago (so not directly usable today even if code is available). Could probably duplicate it with vnc based screencasting. vnc didn't work for us as we needed better performance (i.e. it was built to demonstrate remote display of video and games and there was no real remote audio setup back then so we had to create our own).

the "text" search just used gnome's accessible API much like a screenreader would do (with a bit of per application optimizations as can filter out things like menus and the like, primarily was to dump text out of terminals, firefox and perhaps open office and maybe even a pdf reader if memory serves me correctly, but a long time ago).


You might be able to do the screen recording today using Wayland portals, or nested display servers a la Xpra. That could make per-app recording feasible and relatively transparent.

https://xpra.org


Very cool. I could see a system like this:

1. being the layer upon which AI-assisted computing is trained and built

2. being used for video game QA. For certain heisenbugs it'd be invaluable to be able to restore point-in-time repro to resume just before it happened


the lab I was in did explore it (not work I was directly involved in, so don't have much direct insight into it, besides knowing that we were exploring how to use it for debugging).

https://www.cs.columbia.edu/~nieh/pubs/sigmetrics2010_scribe... https://www.cs.columbia.edu/~nieh/pubs/sigmetrics2011_transp...


This is cool, thanks for sharing it.


What happens if you run "dd if=/dev/zero of=/any/file/here", thus simply loading the disk with all the zeros it can handle? Do you lose all your snapshots as they are deleted to make room, or does it keep some space aside for this situation?

(Not a "gotcha" question, a legitimate question.)


It's configurable: https://nilfs.sourceforge.io/en/man5/nilfs_cleanerd.conf.5.h.... Cleanerd is responsible for maintaining a certain amount of free space on the system, and you can control the rules for doing so (e.x. a checkpoint won't be eligible for being cleaned until it is 1 week old).

It's also worth knowing NILFS2 has checkpoints and snapshots. What you actually get are continuous "checkpoints". These can be upgraded to snapshots at any time with a simple command. Checkpoints are garbage collected, snapshots are not (until they are downgraded back into checkpoints).


the garbage collector daemon will delete older checkpoints beyond the preserve time to make some room.


I know this isn't what you're getting at, but is it smart enough to create a sparse file when you specifically pick zero as your filler byte?


I remember DEC/HP releasing the source to the digital unix AdvFS filesystem on sourceforge with the intent of porting it over to linux, but it never materialized. AdvFS had many advanced features. The source is still available and within it are some PDF slides that explain a lot of it's features.



I noticed 'log based recovery'. Basically ReiserFS.


NILFS is really, really cool. In concept. Unfortunately the tooling and support just isn't there. I ran it for quite some time on my laptop and the continuous snapshoting is everything I hoped it'd be. At one point however there was a change to the kernel that rendered it unbootable. Despite being a known and recorded bug it took forever to get fixed (about a year if I recall correctly) leaving me stuck on an old kernel the whole time.

This was made more frustrating by the lack of any tooling such as fsck to help me diagnose the issue. The only reason I figured out it was a bug was that I booted a live CD to try to rescue the system and it booted fine.

When I finally replaced that laptop I went back to ZFS and scripted snapshots. As much as I want to, I just can't recommend NILFS for daily use.


Do you happen to remember which change in kernel was the cause?

I had troubles with un-popular file systems as root file system when the initrd was not built properly. So sysresccd is always good to have in reach.. Saying this, I think I won't have any other file system on root besides the default of the distro. Data which require special care are on other partitions.


How did Linus not go on a rampage after breaking userspace for an entire year? Is NILFS not part of the kernel mainline, I guess?


> How did Linus not go on a rampage after breaking userspace for an entire year?

Linus' commandment about not breaking userspace is frequently misunderstood. He wants to ensure that user-space /programs/ do not break (even if they rely on buggy behavior that made it into a release), not that the /user/ will never see any breakage of the system whatsoever, which is of course an impossible goal. Device drivers and filesystems are firmly system-level stuff, bugs and backwards-incompatible changes in those areas are regrettable but happen all the same.


If I understand correctly, I don't think this is a userspace-breaking bug, as in: a kernel API changed and made a userspace program not work anymore.

It is a bug that prevents the kernel from booting. That's bad, but that's not the same thing. That's not a userspace compatibility issue such as the ones Linus chases. The user space isn't even involved if the kernel cannot boot. Or if it is actually a userspace program that causes a kernel crash, it is a crash, which is not really the same thing as an API change (one could argue, but that's a bit far-fetched, the intents are not the same, etc - I don't see Linus explode on somebody who introduced a crash the way he would explode on someone changing a userspace API).


> Is NILFS not part of the kernel mainline, I guess?

Good guess, but no:

https://github.com/torvalds/linux/tree/master/fs/nilfs2

> How did Linus not go on a rampage after breaking userspace for an entire year?

I would very much like to know that as well. Any chance it didn't get reported (at least, not as "this broke booting")?


I reported it along with a few other users in https://marc.info/?l=linux-nilfs&m=157540765215806&w=2. I think it just isn't widely enough used that Linus noticed we were broken. If I recall correctly it also wasn't directly fixed so much as incidentally. I just kept checking new kernel versions as they were released until one worked. There was never anything in the change-log (that I recall) about fixing the bug, just another change that happened to fix the issue.

Edit: Looking through the archives, it looks like my memory was somewhat uncharitable. It was reported in November and directly patched in June (https://marc.info/?l=linux-nilfs&m=159154670627428&w=2) so about 7 months after reporting. Not sure what kernel release that would've landed in, so could've been closer to 8.


How does this compare to ZFS + cron to create snapshots every X minutes?


Week ago my client lost data on ZFS by accidentaly deleting folder. Unfortunately the data was created and deleted in the meantime between two snapshots. One would expect that it still might be possible to recover, because ZFS is CoW.

There are some solutions like photorec (which now has ZFS support), but it expects you can identify the file by footprint of its contents, which was not the case. Also many of these solutions would require ZFS to go offline for forensic analysis and that was also not possible because lots of other clients were using the same pool at the time.

So this had failed me and i really wished at the time that ZFS had continuous snapshots.

BTW on ZFS i use ZnapZend. It's second best thing after continuous snapshots:

https://www.znapzend.org/ https://github.com/oetiker/znapzend/

There are also some ZFS snapshotting daemons in Debian, but this is much more elegant and flexible.

But since znapzend is userspace daemon (as are all ZFS snapshoters) you need some kind of monitoring and warning mechanism for cases something goes wrong and it can't longer create snapshots (crashes, gets killed by OOM or something...). In NILFS2 every write/delete is snapshot, so you are basicaly guaranteed by kernel to have everything snapshoted without having to watch it.


There is no comparison. NILFS provides *continuous* snaphots, so you can inspect and rollback changes as needed.

It does without a performance penalty compared to other logging filesystems.

And without using additional space forever. The backlog rotates forward continuously.

It's a really unique feature that makes a lot of sense for desktop use, where you might want to recover files that were created and deleted after a short time.


Perhaps we can leverage "inotify" API to make ZFS snapshot everytime some file had been changed... But i think ZFS is not really good at handling huge amounts of snapshots. The NILFS2 snapshots are probably more lightweight when compared to ZFS ones.


> Perhaps we can leverage "inotify" API to make ZFS snapshot everytime some file had been changed...

ZFS and btrfs users are already living in the future:

  inotifywait -r -m --format %w%f -e close_write "/srv/downloads/" | while read -r line; do
      # command below will snapshot the dataset
      # upon which the closed file is located
      sudo httm --snap "$line" 
  done
See: https://kimono-koans.github.io/inotifywait/


What is httm? I like this script as a proof of concept.

But i still can imagine failure modes, eg. inotify might start acting weird when ZFS remounts the watched directory, OOM killer terminates it without anyone noticing, bash loop go haywire when package manager updates that script (bash is running directly from the file and when it changes during execution, it might just continue running from the same byt offset in completely different script).

All these things actualy happened to me in the past. Not to say that if you have multiple datasets in ZFS you cannot inotify wait on all of them at once, so you will have to manage one bash process per dataset. And performance of bash and sudo might not be that awesome.

So for real reliability you would probably want this to actualy run in ZFS/kernel context...


> What is httm? I like this script as a proof of concept.

See: https://github.com/kimono-koans/httm

> But i still can imagine failure modes, eg. inotify might start acting weird when ZFS remounts the watched directory, OOM killer terminates it without anyone noticing, bash loop go haywire when package manager updates that script (bash is running directly from the file and when it changes during execution, it might just continue running from the same byt offset in completely different script).

I mean, sure, scripts gonna script. You're gonna have to make the POC work for you. But, for instance, I'm not sure half of your issues are problems with a systemd service. I'm not sure one is a problem with a well designed script, which accounts for your particular issues, and a systemd service.

> All these things actualy happened to me in the past. Not to say that if you have multiple datasets in ZFS you cannot inotify wait on all of them at once, so you will have to manage one bash process per dataset. And performance of bash and sudo might not be that awesome.

Yes, you can?

Just re this POC, you can inotifywait a single directory, which contains multiple datasets, and httm will correctly determine and snapshot the correct one upon command. Your real bottleneck here is not sudo or bash. It's the zfs command waiting for a transaction group sync, or arranging for the trans group (or even something else, but its definitely zfs?), to snap.

You can also use `httm -m` to simply identify the dataset and use a channel program and/or a separate script to sync. sudo and bash may not have the performance for your use case, hell, they are composable with everything else?

> So for real reliability you would probably want this to actualy run in ZFS/kernel context...

Yeesh, I'm not sure? Maybe for your/a few specific use cases? Note, inotify (a kernel facility) is your other bottleneck. You're never going to want to watch more than a few/10s of thousand files. The overhead is just going to be too great.

But for most use cases (your documents folder)? Give httm and inotifywait a shot.


The NILFS snapshots are practically free (for a logging filesystem, obviously).


> It does without a performance penalty.

What is the basis for comparison? Sounds like a pretty meaningless statement at its face.


Compared to other logging filesystems obviously.


Nilfs baseline (write throughput especially) is slow as shit compared to other filesystems including f2fs. So just because you have this feature that doesn’t make it even slower isn’t that interesting - you pay for it one way or the other.


For many users filesystem speed of your home directory is completely irrelevant unless you run on a Raspberry Pi using SD cards. You just don't notice it.

Of course if you haver server handling let's say video files things will be very different. And there are some users who process huge amounts of data.

I run 2 lvm snapshots (daily and weekly) on my home partition for years. Write performance is abysmal if you measure it, but you don't note it in daily development work.


> There is no comparison.

What if I compare it to BTRFS + Snapper? No performance penalty there, plus checksumming.


btrfs and snapperd do have a performance penalty as the number of snapshots increases. Having 100+ usually means snapper list will take north of an hour. You can easily reach these numbers if you are taking a snapshot every handful of minutes.

Even background snapper cleanups will start to take a toll, since even if they are done with ionice they tend to block simultaneous accesses to the filesystem while they are in progress. If you have your root on the same filesystem, it's not pretty -- lots of periodic system-wide freezes with the HDD LEDs non-stop blinking. I tend to limit snapshots always to < 20 for that reason (and so does the default snapperd config).


About 2 years ago I believed the same. Then I used BTRFS as a store for VM images (with periodoc snapshot) and performance went down to really really bad. After I deleted all snapshots performance was good again. There is a big performance penalty in btrfs with more than about 100 snapshots.


Did you disable CoW on the VM image files? This makes a significant difference to performance on BTRFS


>It's a really unique feature that makes a lot of sense for desktop us

Sounds like it could serve as a basis for a Linux implementation of something like Apple Time Machine.


With 'httm', a few of us are already living in that bright future: https://github.com/kimono-koans/httm


Afaik Time Machine does not do continuous snapshots, just periodic (and triggered).

So you can already do that with zfs: take a snapshot and send it to the backup drive.


"It does without a performance penalty"

yeah. it's already so terribly slow that it's unlikely that taking snapshots can make it any slower :-D


That was not my experience with NILFS. It outperformed ext4 on my laptop NVME.


The benchmarks here look pretty bad:

https://www.phoronix.com/review/linux-58-filesystems/4


The last page looks pretty bad. If you look at the others it's more of a mixed bag, but yeah.

I don't remember what benchmark I ran before deciding to run it on my laptop. Given my work at the time probably pgbench, but I couldn't say for sure. It was long enough ago I also might've been benchmarking against ext3, not 4.


i think i was running it on 6TB conventional HDD RAID1. also note that the read and write speeds might be quite asymetrical... in general also depends on workload type.


I run this setup. zfs + zfsnap (not cron anymore, now systemd.timer).

I cannot tell if NILFS is doing this too, with zfsnap I maintain different retention times. 5-minutely for 1hour, hourly for 1day, daily for a week. That are less than 60 snapshots. The older ones are cleaned up.

In addition, zfs brings compression and encryption. That's why I have it on the laptops, too.


I've been running NILFS2 on my main work NAS for 8 years. It never failed us :)


I mean this honestly: how did you evaluate such a new filesystem in order to bet a work NAS upon it?


I've made some testing, and installed it on a secondary system that in the beginning mostly hosted unimportant files. Then we added more things, and as after a few years it posed absolutely no problem we went further (and added a backup procedure). Then we migrated to new hardware, and it's still going strong (it's quite small, about 15 TB volume).


I would do it by using it! ... and probably some backup


Do any file systems have good, native support for tagging and complex searched based on those tags?


BeFS was the last real one i'm aware of at the complexity you are talking about (plenty of FSen have some very basic indexed support for say file sizes , but not the kind of generic tagging you are talking about)

At this point, the view seems to be "attributes happen in the file system, indexing happens in user space".

Especially on linux.

Part of the reason is, as i understand it, the surface/complexity of including query languages in the kernel, which is not horribly unreasonable

So all the common FSen have reasonable xattr support, and inotify/etc that support notification of attribute changes.

The expectation seems to be that the fact that inotify might drop events now and then is not a dealbreaker. The modern queue length is usually 16384 anyway.

I'm not saying there aren't tradeoffs here, but this seems to be the direction taken overall.

I actually would love to have an FS with native indexed xattr and a way to get at them.

I just don't think we'll get back there again anytime soon.


Okay - how about tagging and non-complex searches then. Beggars can't be choosers :-)

Really what I'd like is just to search for some specific tags, or maybe list a directory excluding some tag, or similar. For bonus points, maybe a virtual directory that represents a search like this, and which "contains" the results of that search. (A "Search Folder")

I'll check out BeFS. Thanks!


How close is this to a large continuous tape loop for video surveillance?

I would very much welcome a filesystem that breaks away from the directories/files paradigm. Any time-based data store would greatly benefit from that.


I think all you would need to add is a daemon that automatically deletes the oldest file(s) whenever free space drops below a certain threshold, so that the filesystem GC can reclaim that space for new files.


I know and use 'logrotate'.

My point was more on the tracks of a filesystem where a single file can be overwritten over and over again, and it's up to the filesystem to transparently ensure the full capacity of the disk is put towards retaining old versions of the file.



I definitely need to dive into Ceph, thanks for the pointer :-)


If NILFS is continuously checkpointing, couldn't you even remove the file right after you add it, for simplicity?


I've always wondered why NILFS (or similar) isn't used for cases where ransomware is a risk. I'm honestly surprised that it's not mandated to use a append-only / log-structured filesystem for some critical systems (think patient records), where the cost of losing data is so high, rarely mutated, and trading it off for wasting storage isn't that bad (after all, HDD storage is incredibly cheap, and nobody said you had to keep the working set and the log on the same device).


you don't need a log structured fs to do this, you could just have regular zfs/btrfs snapshots too.

BUT

if an attack has the ability to delete an entire file system / encrypt it, they really have the ability to delete the snapshots as well, the only reason they might not is due to "security through obscurity".

now, what I have argued is that an append only file system which works in a SAN like environment (i.e. you have random reads, but only append writes properties that are enforced remotely) could give you that, but to an extent you'd still get a similar behavior by just exporting ZFS shares (or even as block devices) and snapshotting them regularly on the remote end.


> if an attack has the ability to delete an entire file system / encrypt it, they really have the ability to delete the snapshots as well, ..

How so?

Let's say you have one machine holding the actual data for working on it. And some backup server. You could use btrfs send over ssh and regularly btrfs receive the data on the backup machine. Even it they got encrypted by ransomware they wouldn't be lost in the backups. As long as they're not deleted there how could a compromised work machine compromise the data on the backup machine?


I'm not anywhere near a user of btrfs send/recieve api, but I was under the impression that it essentially streams commands and hence could conceptually also be used to delete snapshots on the remote side?


I don't know about the API but as far as I understand the man page, there doesn't appear to be a way to delete a subvolume on the remote side by doing btrfs receive.

That btrfs receive may well be exploitable (because of some bug) if you feed it with malicious data is IMO a different topic.


so I'm looking at Oracle's docs

https://docs.oracle.com/cd/E37670_01/E37355/html/ol_srbackup...

while it doesn't allow you to delete a subvolume (i.e. what a snapshot is built off of), it does seem like one could conceptually just overwrite it via send/recieve? perhaps its a bit smarter and doesn't let snapshots be modified, but I'm now a bit unconvinced that it can be viewed as simply an "append only" interface.


> ... it does seem like one could conceptually just overwrite it via send/recieve?

Don't think so.

The man page for btrfs receive states:

> btrfs receive will fail in the following cases:

> 1. receiving subvolume already exists

I also just tried it out. I created some subvol1-ro in a source filesystem, then sent it to the destination filesystem. Then deleted the source subvol1-ro. Then created a new subvol1-ro containing a different file and tried to send it to the same destination.

Result:

> ERROR: creating subvolume subvol1-ro failed: File exists


so, if I use a subvolume (not as a snapshot, but as something I can modify) I can't use send/receive on it? it seems a little weird. I'd assume I should be able to send modifications to the remote end (and my argument is, if one can send modifications, one should be able to send modifications that effectively delete all content). I temper this argument that even if true, perhaps subvolumes marked as snapshots one cannot do this to.


AFAIK only read only snapshots can be sent and received. It kind of makes sense, since a writable snapshot could be written to during the send/receive so that the remote result would not match any clearly defined local state.

A received ro snapshot can be used as a base for a new writable snapshot.


ok, you are probably right, glanced at send.h and it seems it only allows a limited set of "btrfs" fs operations.

that would fit with my description (at least in a sense) it enables an "append" only mode to a remote side (where "append" means cant mess with previous snapshots remotely).


What happens when you store a virtual machine hard disk image on this? When you boot the VM, is the entire image duplicated via CoW every time something changes in the VM?


Didn't VMS have this baked in? My memory is that all 8.3 file names had 8.3[;nnn] version tagging under the hood


That's what it looked like, but I doubt it was deep in the filesystem. It was basically just a naming convention. User had to purge old versions manually. This gets tedious if you have many files that change often. Snapshots are a safety net, not something you want to have in your way all day long.


Er.. my memory is that it did COW inside VMS fs semantics and was not manually achived. You did have to manually delete. So I don't think it was just a hack.

It didn't do directories so was certainly not as good as snapshot but we're talking 40 years ago!


https://en.wikipedia.org/wiki/OpenVMS#File_system

>> DEC attempted to replace it with a log-structured file system file system named Spiralog first released in 1995. However, Spiralog was discontinued due to a variety of problems, including issues with handling full volumes.


The link you want is https://en.m.wikipedia.org/wiki/Files-11

Every file has a version number, which defaults to 1 if no other versions of the same filename are present (otherwise one higher than the greatest version). Every time a file is saved, rather than overwriting the existing version, a new file with the same name but an incremented version number is created. Old versions can be deleted explicitly, with the DELETE or the PURGE command, or optionally, older versions of a file can be deleted automatically when the file's version limit is reached (set by SET FILE/VERSION_LIMIT). Old versions are thus not overwritten, but are kept on disk and may be retrieved at any time. The architectural limit on version numbers is 32767. The versioning behavior is easily overridden if it is unwanted. In particular, files which are directly updated, such as databases, do not create new versions unless explicitly programmed.


How is this pronounced? Nil-F-S? Nilfuss? Nai-L-F-S? N-I-L-F-S?


The first one.


Very nice introduction to NILSFS, which has been in the Linux kernel since 2009.


What's the difference between a snapshot, and a checkpoint?


from TA:

> A checkpoint is a snapshot of your system at a given point in time, but it can be deleted automatically if some disk space must be reclaimed. A checkpoint can be transformed into a snapshot that will never be removed.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: