I have personally written a similar tool and I am very curious about how this co...

e-dant · on Oct 18, 2022

More technically, here’s what we have:

A “baseline” filesystem watcher which uses only the standard library. It has been made to beat kqueue. And it does.

A platform filesystem watcher for Darwin is used, but certain event properties are handled by the standard library. Namely, the event time and the path type.

A platform filesystem watcher is schedule for Windows. Work hasn’t been started.

A platform filesystem watcher for Linux (> 2.4 or so) was toyed with but ultimately rejected out of accuracy concerns. It was far more efficient than the cross-platform implementation “warthog”, no doubt, but it lacked accuracy. Work is being done to get most of the benefits from both worlds.

There are problems with the “baseline” watcher (which I’ve named “warthog” because it’s sturdy and reliable). But those are potential efficiency losses when watcher more than a few million paths. They are, thankfully, not accuracy or safety problems.

Maybe you can see the solution emerging here?

Here’s where we’re going next:

The most efficient kernel watchers can be used on most platforms, but checked for their accuracy periodically by the “warthog” watcher.

diffxx · on Oct 18, 2022

What do you mean by beat kqueue? Is it faster than kqueue? Does it use less memory than kqueue?

How does the baseline filesystem watcher work? If it doesn't use kqueue, does it poll the filesystem periodically and diff against an in memory representation? If yes, see my other comments. If not, I am genuinely curious what you are doing because you know something that I do not.

e-dant · on Oct 18, 2022

When I began this project, I started with kqueue. The performance was wanting and there were bugs with very large file trees.

I moved to a minimal std::filesystem-based watcher and optimized it from there.

There hasn’t been a formal head-to-head test between the two. That should be about halfway down my todo list. It’s worth revisiting more formally.

My response to this question should help here: https://news.ycombinator.com/item?id=33247155#33251437

In short, there’s no secret sauce. There’s an efficiency spread in (what I consider) edge-cases.

Every potential gain over other naive watchers implemented with kqueue is likely algorithmic. I store events in a historical map, compare differences to the current state of the file tree, prune them, and send events when they change. That’s the whole implementation: scan paths, record their attributes, check for differences in the map, and send events when they happen. I haven’t given much thought to exactly why it beats kqueue, nor are there any good tests showing by how much. (Again, this is worth doing.)

diffxx · on Oct 18, 2022

Makes sense. I have only used kqueue on macos to monitor a small number of files and I find it quite painful to use and the semantics were confusing, not sure if it is different on say freebsd.

Just as a heads up, one of the strange fsevents issues is that it fails if you register two directories where one directory is a prefix of the other. So say that you want to monitor directories $ROOT/foo and $ROOT/fo and you register an event stream first with $ROOT/foo and then $ROOT/fo, you will only receive events for paths in $ROOT/fo and no events for paths in $ROOT/foo (I just double checked that this is still the case in Monterey at least). I never bothered to report this to apple but worked around it by just registering a stream with $ROOT if I detected that one path name was a substring of another.

marwis · on Oct 19, 2022

Have you tried using auditpipe?

commandlinefan · on Oct 18, 2022

> hook into the underlying kernel events like kqueue...

I'm really surprised that this sort of functionality isn't built into OS's/filesystems. I recently had to do this for HDFS, and I finally "gave up" and polled the file system like you suggest as your first option. Event notification seems like something that ought to be a fundamental feature and is best owned by the file system itself.

nomel · on Oct 18, 2022

> I'm really surprised that this sort of functionality isn't built into OS's/filesystems

It appears to be built into macOS [1]?

> Whenever the filesystem is changed, the kernel passes notifications via the special device file /dev/fsevents to a userspace process called fseventsd

Which I assume is what they're referring to here:

> A platform filesystem watcher for Darwin is used, but certain event properties are handled by the standard library. Namely, the event time and the path type.

1. https://en.wikipedia.org/wiki/FSEvents

diffxx · on Oct 18, 2022

> It appears to be built into macOS [1]?

It is, but it's badly implemented and buggy. But the real problem is that there is no posix like specification for file system events so every platform does it differently. Even if every platform implementation were perfect and bug free, it is a huge pain to write wrappers for each one.

diffxx · on Oct 18, 2022

Completely agree. That is why having built a tool similar to this one, I'm not even linking to it. The complexity involved in working around the OS limitations is maddening and convinced me that it would be better to think of a different approach to writing software that wouldn't require monitoring files to achieve the fast feedback loop that these tools are designed to facilitate.

The magic file approach described by kevincox below is probably the best way to get > 95% of the benefit with < 1% of the work.

e-dant · on Oct 18, 2022

It’s difficult to get it perfectly right.

There is ongoing work attempting to make it more perfect.

I expect a year or two before this is complete.

For now though, it does do what it says. The tests I’ve run show that it is accurate over large amounts of events and time. For under 1 million files and/or directories, it uses a near-zero amount of resources. Testing on older processors shows similarly positive results.

But this is so far from perfect. This is only the groundwork. Most of the bugs have yet to be discovered. The platform support, more often than not, uses the safe “baseline” watcher in favor of accuracy.

Ned14 of Boost fame has given the project some expert advice which will help it along smoothly.

diffxx · on Oct 18, 2022

What do you mean near-zero? You said that inotify doesn't work (and ned14 offers his comments about it). If you are using polling, I do not understand how your approach could be using non-zero amount of resources. Let's say you are monitoring a directory with 1 million files, how can you store the state in less than 20MB of memory (which is about the most optimistic lower bound that I can think of)? What is your secret sauce? Do you mean there is no overhead beyond the baseline watcher? But what about the overhead of the baseline watcher itself?

For what it's worth, in spite of ned14's comments, I have never seen inotify fail in practice (except for if it hits the os file descriptor limits in which case it does fail noisily). The tool I wrote uses inotify for linux. It is used by thousands of developers every day as part of an editor integration and there are no open issues about dropped file events.

Your time frame is probably about right. It took me about a year to work through all the edge cases.

e-dant · on Oct 18, 2022

Near-zero is a bit loose. It keeps a relatively compact in-memory representation. You’re about right with your estimate. Having measured just now, it’s about 30mb for 1 million directories.

The baseline Watcher’s efficiency has a wide spread. When there are many thousands of nested subdirectories, the CPU approaches the limit of the thread it’s on. Flatter directories, or many files without nested subdirectories, do not have nearly as much of an effect. I’ve seen it run on around 10 million paths on a very flat test directory.

So, near-zero is somewhat misleading. There’s a wide spread in efficiency. It was my judgement that deeply nested directory trees were far less common in practice then, so I wrote “near-zero” in the optimistic case.

It uses polling under the hood (at least, I’m sure it does. It uses whatever std::filesystem uses, which is almost certainly polling).