Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Filesystem Watcher (github.com/e-dant)
91 points by e-dant on Oct 18, 2022 | hide | past | favorite | 73 comments
An arbitrary filesystem event watcher which is:

- simple

- efficient

- dependency free

- runnable anywhere with a filesystem

- header only

Watcher is extremely efficient. In most cases, even when scanning millions of paths, this library uses a near-zero amount of resources.

Watcher is simple. The library exposes a single function and a single object. That is all.

Happy hacking.




I have personally written a similar tool and I am very curious about how this could be using a near-zero amount of resources while maintaining accuracy. As far as I know, there are two ways to implement this functionality: 1) store an in memory representation of the file system and periodically refresh the in memory state by polling the paths under watch and emitting events when differences are detected 2) hook into the underlying kernel events like kqueue, inotify, fsevents, ReadDirectoryChangesW, etc and report events

Option 1 uses a lot of CPU and memory (the map storing the paths being monitored could easily grow to be tens or even hundreds of megabytes if many files are being monitored, which is often the case in large source projects). I have seen tools that use polling with a 100ms interval continuously burn 50% of cpu monitoring a modest sized directory with tens of thousands of files.

Option 2 theoretically would use less memory and little to no cpu, but in practice, the story is more complicated. If you are using an inotify or kqueue like api, you will have to store handles for all of the paths that are being monitored, which can take a significant amount of memory. On macos, the file system events are not accurate in the sense that you can't trust the type of event. It doesn't reliably distinguish between creation and modification events. So if you want to know specifically what kind of event happened, you end up back in case 1 where you have to store an in memory representation of the file system and diff against the in memory representation and the current file system state when you detect an event. For some use cases, you may not care to distinguish between creations and modifications and can get away with a lower memory, but less accurate, solution.

In my experience, getting all of this right is much more difficult than it appears at first glance. Good luck to you.


More technically, here’s what we have:

A “baseline” filesystem watcher which uses only the standard library. It has been made to beat kqueue. And it does.

A platform filesystem watcher for Darwin is used, but certain event properties are handled by the standard library. Namely, the event time and the path type.

A platform filesystem watcher is schedule for Windows. Work hasn’t been started.

A platform filesystem watcher for Linux (> 2.4 or so) was toyed with but ultimately rejected out of accuracy concerns. It was far more efficient than the cross-platform implementation “warthog”, no doubt, but it lacked accuracy. Work is being done to get most of the benefits from both worlds.

There are problems with the “baseline” watcher (which I’ve named “warthog” because it’s sturdy and reliable). But those are potential efficiency losses when watcher more than a few million paths. They are, thankfully, not accuracy or safety problems.

Maybe you can see the solution emerging here?

Here’s where we’re going next:

The most efficient kernel watchers can be used on most platforms, but checked for their accuracy periodically by the “warthog” watcher.


What do you mean by beat kqueue? Is it faster than kqueue? Does it use less memory than kqueue?

How does the baseline filesystem watcher work? If it doesn't use kqueue, does it poll the filesystem periodically and diff against an in memory representation? If yes, see my other comments. If not, I am genuinely curious what you are doing because you know something that I do not.


When I began this project, I started with kqueue. The performance was wanting and there were bugs with very large file trees.

I moved to a minimal std::filesystem-based watcher and optimized it from there.

There hasn’t been a formal head-to-head test between the two. That should be about halfway down my todo list. It’s worth revisiting more formally.

My response to this question should help here: https://news.ycombinator.com/item?id=33247155#33251437

In short, there’s no secret sauce. There’s an efficiency spread in (what I consider) edge-cases.

Every potential gain over other naive watchers implemented with kqueue is likely algorithmic. I store events in a historical map, compare differences to the current state of the file tree, prune them, and send events when they change. That’s the whole implementation: scan paths, record their attributes, check for differences in the map, and send events when they happen. I haven’t given much thought to exactly why it beats kqueue, nor are there any good tests showing by how much. (Again, this is worth doing.)


Makes sense. I have only used kqueue on macos to monitor a small number of files and I find it quite painful to use and the semantics were confusing, not sure if it is different on say freebsd.

Just as a heads up, one of the strange fsevents issues is that it fails if you register two directories where one directory is a prefix of the other. So say that you want to monitor directories $ROOT/foo and $ROOT/fo and you register an event stream first with $ROOT/foo and then $ROOT/fo, you will only receive events for paths in $ROOT/fo and no events for paths in $ROOT/foo (I just double checked that this is still the case in Monterey at least). I never bothered to report this to apple but worked around it by just registering a stream with $ROOT if I detected that one path name was a substring of another.


Have you tried using auditpipe?


> hook into the underlying kernel events like kqueue...

I'm really surprised that this sort of functionality isn't built into OS's/filesystems. I recently had to do this for HDFS, and I finally "gave up" and polled the file system like you suggest as your first option. Event notification seems like something that ought to be a fundamental feature and is best owned by the file system itself.


> I'm really surprised that this sort of functionality isn't built into OS's/filesystems

It appears to be built into macOS [1]?

> Whenever the filesystem is changed, the kernel passes notifications via the special device file /dev/fsevents to a userspace process called fseventsd

Which I assume is what they're referring to here:

> A platform filesystem watcher for Darwin is used, but certain event properties are handled by the standard library. Namely, the event time and the path type.

1. https://en.wikipedia.org/wiki/FSEvents


> It appears to be built into macOS [1]?

It is, but it's badly implemented and buggy. But the real problem is that there is no posix like specification for file system events so every platform does it differently. Even if every platform implementation were perfect and bug free, it is a huge pain to write wrappers for each one.


Completely agree. That is why having built a tool similar to this one, I'm not even linking to it. The complexity involved in working around the OS limitations is maddening and convinced me that it would be better to think of a different approach to writing software that wouldn't require monitoring files to achieve the fast feedback loop that these tools are designed to facilitate.

The magic file approach described by kevincox below is probably the best way to get > 95% of the benefit with < 1% of the work.


It’s difficult to get it perfectly right.

There is ongoing work attempting to make it more perfect.

I expect a year or two before this is complete.

For now though, it does do what it says. The tests I’ve run show that it is accurate over large amounts of events and time. For under 1 million files and/or directories, it uses a near-zero amount of resources. Testing on older processors shows similarly positive results.

But this is so far from perfect. This is only the groundwork. Most of the bugs have yet to be discovered. The platform support, more often than not, uses the safe “baseline” watcher in favor of accuracy.

Ned14 of Boost fame has given the project some expert advice which will help it along smoothly.


What do you mean near-zero? You said that inotify doesn't work (and ned14 offers his comments about it). If you are using polling, I do not understand how your approach could be using non-zero amount of resources. Let's say you are monitoring a directory with 1 million files, how can you store the state in less than 20MB of memory (which is about the most optimistic lower bound that I can think of)? What is your secret sauce? Do you mean there is no overhead beyond the baseline watcher? But what about the overhead of the baseline watcher itself?

For what it's worth, in spite of ned14's comments, I have never seen inotify fail in practice (except for if it hits the os file descriptor limits in which case it does fail noisily). The tool I wrote uses inotify for linux. It is used by thousands of developers every day as part of an editor integration and there are no open issues about dropped file events.

Your time frame is probably about right. It took me about a year to work through all the edge cases.


Near-zero is a bit loose. It keeps a relatively compact in-memory representation. You’re about right with your estimate. Having measured just now, it’s about 30mb for 1 million directories.

The baseline Watcher’s efficiency has a wide spread. When there are many thousands of nested subdirectories, the CPU approaches the limit of the thread it’s on. Flatter directories, or many files without nested subdirectories, do not have nearly as much of an effect. I’ve seen it run on around 10 million paths on a very flat test directory.

So, near-zero is somewhat misleading. There’s a wide spread in efficiency. It was my judgement that deeply nested directory trees were far less common in practice then, so I wrote “near-zero” in the optimistic case.

It uses polling under the hood (at least, I’m sure it does. It uses whatever std::filesystem uses, which is almost certainly polling).


"Watcher is extremely efficient. In most cases, even when scanning millions of paths, this library uses a near-zero amount of resources." Yea, maybe or maybe not and my first guess is maybe not.

This needs at least some bullet points on HOW it does this so efficiently so that I'll keep looking. A blanket statement like this means "they hope it is efficient" or "They want it to be efficient" or "It's good in some scenarios but not others".

With those additional bits, I have a reason to dig around the source.


Is this just using inotify on Linux?

If so, there are equivalent options, including systemd path units, incron, and the inotifywait utility, in addition to the C API.

The "man systemd.path" page does list explicit limitations of this kernel system call:

"Internally, path units use the inotify(7) API to monitor file systems. Due to that, it suffers by the same limitations as inotify, and for example cannot be used to monitor files or directories changed by other machines on remote NFS file systems." (Files modified by mmap() also don't trigger events.)

https://www.linuxjournal.com/content/linux-filesystem-events...

Windows busybox also has an inotifyd, which appears to do something similar.


You are right. I’ll make sure to give a deeper breakdown in the readme.


Looking at Win32, it scans the whole directory periodically, right? I must miss something, but how can that be called efficient?


I don't know about efficient, but at least it sounds reliable. I'm at my wit's end with trying to figure out why KDE's Dolphin can't reliably watch a directory for new files, frequently (but not always) forcing me to F5 to see new files.


It’s efficient because it beats kqueue while reporting events accurately.

A proper benchmarking program is in the works, however manual testing does show only minimal resource usage.

For more, see this issue: https://github.com/e-dant/watcher/issues/10


I only mentioned Win32. There this library is very inefficient compared to ReadDirectoryChangesW, which consumes no CPU times when nothing changes.



Wait, it doesnt hook the OS file handling routines? it actually manually rescans the filesystem?


It does one or the other. There are concerns about OS filesystem event hooks.

The current solution isn’t ideal, and is being addressed here: https://github.com/e-dant/watcher/issues/10


not on windows (only platform I checked/care about)


Looks like it does exactly what you can get out of Everything (https://www.voidtools.com) Index Journal

https://www.voidtools.com/forum/viewtopic.php?t=9792

but programmatically and its highly scriptable, pretty cool. Will definitely add it to my arsenal of troubleshooting tools.

Edit: never mind, This tool is manually scanning the filesystem instead of listening to OS events https://github.com/e-dant/watcher/blob/989147b183ee0547d71a1...


For clarity, the author did say in another comment that for Windows they plan to implement system API calls to watch files instead of manually scanning the filesystem. For macOS and linux it is listening to OS events.


Everything uses https://en.wikipedia.org/wiki/USN_Journal for fast NTFS monitoring.



Voidtools gives security warnings in my browser, fwiw.


Of course it does, its competing with Microsoft by providing actually working instant local search.


My chrome browser.


> instant local search

still checks out :) But I checked just to be sure and no warning in Version 106.0.5249.91 (Official Build) (32-bit). Maybe its your corporate baby content web filter?


No, its a certificate issue; probably minor, but a server-side thing to attend to by the looks of it.


There are no certificate issues in my Chrome. Are you sure your not on some MITMing VPN?


(late update for anybody reading this): looks like you were correct, (apologies to voidtools). When I click through the (local) corporate content blocker kicks in.


Any reason for making delay_ms a template parameter? A compiler should be able to optimize passing a constant as a regular function argument. And if it’s not optimized I assume a variable delay wouldn’t affect much?


No perfectly good reason. I will look into that before version 1.


In the early days of Linux, there was a tool that saved file info to a floppy. Then you would write protect the floppy and leave it in a drive, and the tool would periodically compare OS files with that to detect alteration. I can't for the life of me remember the name, though. It was great for hardened systems.



When I last tried to implement this, by far the toughest part was making sure the file that’s been newly detected is done being written to. On ntfs I couldn’t find a good technique, even last modified time was not reliable. I had to watch it for changes myself.


I've done this by watching the NTFS journal which is surprisingly efficient. First I scanned the whole journal for filesystem metadata and dumped it into a SQLite database (which took about a minute), then kept it up to date which took virtually no resources. This was an absurdly faster way to search by file name, a search across the whole FS came back in milliseconds instead of Explorer's multiple minutes.


Is "last modified" the time of the beginning of the write?


Never even thought of that; I don’t know. I assumed it was when a write was done. Whatever that means, I don’t know either.


Would this mean that fs event based antivirus scanners could be side-stepped by writing a payload to a file and then never closing the handler?


Good point here, I will definitely check this and report back. :)


How does this work under the covers on Linux? Is it using eBPF, or is it simply an abstraction over inotify?

I'm particularly interested in something like this, but which will include information about what process made the change, and which user it was running as at the time.


I’ve gone back and forth with inotify on Linux. Ned14 gave a great rundown of the ideal next steps for a best-possible implementation.

You can check out issue/10 for a full description of how it works now, why neither inotify nor our current solution is ideal, and where the project will be going next.


I'm sorry, I don't know what issue/10 means? Is it an e-zine or something? (apologies if this is obvious, I'm extremely tired!)



Ah, sorry, that really should have been obvious :D

Have you looked into using eBPF for tracking file system changes at any point? (I don't mean for this project, as it's clear you're taking a particular approach that will work across platforms).


>or is it simply an abstraction over inotify?

Looks like it:

https://github.com/e-dant/watcher/blob/989147b183ee0547d71a1...


`scan_directory` in the same file recursively iterates directories and no calls to `inotify_*` functions seem to be made; no grep matches in the project directory.



Is there a well-tested, reliable, flexible and good working tool for windows that does the same and can be installed as a service? Just watch a directory and do something that I can configure easily with a textfile?


Well, I am not too lazy to search but I was interested in your experience, especially with reliability.

This one looks interesting: https://github.com/emcrisostomo/fswatch


How does this compare to something like famd(8) and if it’s a marked improvement, could the techniques used here be backported to FAM?


Looks great! I'm wondering what operating systems are supported. I'm assuming Linux. What about macOS and Windows?


https://github.com/e-dant/watcher/blob/989147b183ee0547d71a1...

Looks like it works on quite a few systems including Android and iOS.


All are supported.

Although, to be more efficient, I need to write system API calls for Windows.

That will be the 1.0 release.


[flagged]


They were responding to a comment asking about Linux, MacOS, and Windows support.

Needless pedantry is one thing, but deliberately misconstruing a discussion to support said pedantry is sad.

https://news.ycombinator.com/newsguidelines.html


It uses FSEvents on the Mac. It can do dumb polling too.


forgot "written in C++" .. unless you're trying to get views, better to leave that part out.


What do you mean?


Presumably a joke that Rust projects get all the love.


Does it keep working when files get overwritten by moving another over it, like some Linux text editors do?


I should test this. I haven’t seen a problem with that so far in my personal usage, so I’m inclined to say probably.

That’s a good test case. I’ll make an issue.


This was a case that tripped me up when using Qt's QFileSystemWatcher. Many apps implement "atomic updates" to files by writing out to a new file (e.g. foo.txt.temp), then moving that file on top of the old one. As far as QFileSystemWatcher was concerned, the old file was deleted and its job is done (just because the new one has the same name doesn't mean it's the same file). So you have to watch the parent directory and manually implement checking for a specific file by name when the parent directory's contents changed.


Should this work over NFS or SMB?


See also:

sane - for node

watchexec - rust based, static binary


For CLI usage I found that the best option was instead of watching my source directory just to watch a magic file. Then I configured my editor to touch that file when saving. This has a few benefits:

1. No need to worry about which files to watch or ignoring build outputs.

2. Works with every project with no setup.

3. Easy to trigger a re-run without actually changing a file.

4. Always runs after all files are saved instead of starting after the first file is saved and racing the rest.

5. Infinitely scalable.


Out of curiosity, what editor do you use and how do you make sure 4. happens when for example using ‘save all’?


I'm currently using neovim so it is pretty trivial to add save hooks. Although the approach I am currently using is just a custom shortcut that saves all files and touches the file. This way only explicit saves by me trigger the rerun.

My current setup is documented here but it's easy to tweak to your prefered workflow. https://kevincox.ca/2022/06/14/small-tools/#w



[deleted]




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: