I thought it was very cool how gVisor is multi-backend (their “sentry” implemented vie either ptrace or kvm), which is pretty unusual with instrumentation tools.
We could maybe have shared this logic to intercept syscalls and redirect them to user space code serving as the kernel. That is, we could have shared the Reverie layer. We saw ourselves as headed towards an in-guest binary instrumentation model (like rr’s syscall buffer). And so one factor is that Rust is a better fit than Go for injecting code into guest processes.
Regarding the actual gVisor user space kernel.. we could have started with that and forked it to start adding determinism features to that kernel. At first glance that would seem to save on implementation work, but “implement futexes deterministically” is a pretty different requirement than “implement futexes”, so it’s not clear how much savings could have been had.
We could still have a go at reusing their kvm setup to implement a Reverie backend. But there’s some impedance matching to do across the FFI there, with the Reverie API relying on Rusts async concurrency mode and Tokio. Hopefully we could cleanly manage separate thread pools for the go threads taking syscalls vs the Tokio thread pool hosting Reverie handler tasks. Or maybe it would be possible to reuse their solution without delivering each syscall to Go code.
The distinction that we often have trouble getting across is between eliminating/controlling nondeterminism, vs recording what happens to occur.
Undo, like rr, and Microsoft TTD, records basically all syscalls, but doesn’t determinize anything in the original execution, only in the replay. A “hermit run” call is like a 0-byte recording —- no nondeterminism so you can “replay” by just running again.
On the overhead, the largest factor is the means of program instrumentation. I don’t know where rr sits, but I’ve heard Microsoft’s solution is quite performant on Windows.
There’s a bit of shared lineage there. Joseph Devietti and I started working in this area together around 2015. And Joe had worked on the deterministic dOS at UW during his PhD (which led to Jinx).
Another related effort is antithesis.com which also seems to use a hypervisor approach rather than Hermits Linux-syscall-API level approach.
Well, we don't prove the absence of concurrency bugs -- that would be more a job for formal verification, type systems, at the source level.
But we can tell when our `--chaos` stress tests cease to produce crashes in reasonable numbers of runs. And when we do achieve a crash we can use our analysis phase to identify the racing operations.
It's both a pro and a con of the approach that we work with real crashes/failures. This means its a less sensitive instrument than tools like TSAN (which can detect data races that never cause a failure in an actual run), but conversely we don't have to worry about false positives, because we can present evidence that a particular order of events definitely causes a failure. Also we catch a much more general category of concurrency bugs (ordering problems between arbitrary instructions/syscalls, even between processes and in multiple languages).
That's still roughly accurate because Hermit is, today, still ptrace-powered. I'll quote my reply from elsewhere about the WIP high-perf backend:
> The `experimental/reverie-sabre` directory in the Reverie repo contains our high performance backend, but it's still work-in-progress. It uses binary instrumentation and in our early experiments is 10X faster than our current backend in the worst case (i.e. strace is >10X faster when rewritten with reverie-sabre and run on a program that does nothing but syscalls).
Indeed, releasing a faster drop-in "riptrace" strace replacement is one of the goals ;-).
> How are you sanitizing the result of stat(), for example?
Ah, that part's the same as it was described in the ASPLOS'20 paper (https://dl.acm.org/doi/10.1145/3373376.3378519). Briefly, we present a somewhat sanitized version of the file system. Like if you do `hermit run -- ls -l`, you'll see that you are root, and everything owned by you is owned by root (and everything else is owned by nfsnobody currently). The file mod/access times are all set to whatever you provide for `--epoch`, because we think you mainly care about the file system contents, not the mod times. (Mod times do CHANGE as execution progresses though, otherwise programs like `make` become confused.)
> fixed file system images are not sufficient to guarantee deterministic behavior by filesystems.
Indeed! Hence we present only the sanitized view of the filesystem, so that we can treat each filesystem as its abstract contents (files as bitstrings, and directories as sorted lists of files). For inodes, we translate to and from virtual inodes, rather than reveal the real inodes of the file system.
If there's a flaw in this abstraction of the file system... we'd love to find it. Mainly we've been using it with zfs and Meta's "edenfs" virtual file system so far.
> `st_nblocks` and `st_blksize` fields in `struct stat` can change in-between stat() calls when nothing else happens
Yes, now you feel our pain. We've been plugging these kinds of things, and there are surely some we missed. We report boring constant values where we can get away with it (for st_blksize). Irreproducible "counter example" programs are a welcome form of contribution ;-)!
> As another example, there are filesystems that generate a random seed
Ah, that's an interesting one. Since we determinize (only) userspace, any random bytes that get into guest memory are deterministic (getrandom, /dev/random, etc), but internal nondeterminism in the filesystem we won't see, and instead we'll rely on sanitizing the way the filesystem appears to userspace. But if it just affects order, we should be ok, because we sort before returning from the getdents syscall.
> reading the entire directory and then sorting the results, even when doing a single readdir() call to read a single directory entry.
Yes, unfortunately. That's not great for performance, and we need to implement a caching scheme to at least amortize the overhead of this sort for pathological cases. Still, we will need to trigger the sort even on a single read as you say. So there's a pathological case there -- listing one item from a ginormous directory -- that would run much slower than the native/nondeterministic version.
> telldir() / seekdir()
Well these (and readdir) are not syscalls. So if we've handled getdents correctly we should be ok here. But this is a good suggestion for tests we need to add that stress these!
> in-memory view can become out-of-date in-between readdir() / telldir() / seekdir() calls by one process
Yes, we have a "non-interferene" assumption that either you're running with a container-private file system (e.g. inside docker) or none of your directories are concurrently messed with by other processes outside the container.
> assign deterministic inode numbers for all directory entries, which also seems non-trivial.
Yep, that's what we do!
> Also, CPUs can (according to the manuals, at least) behave non-deterministically when executing instructions in certain undefined conditions. How do you handle this?
We require a certain level of good behavior by the program. To be practical, we aim to work for realistic programs but not necessarily adversarial ones. One way that you can break our sandboxing, for example, is to run CPUID, learn that the processor does not support instruction X, but then execute X anyway. For example, we can trap RDTSC in userspace, but not RDRAND.
If someone wants to use Hermit for, say, reverse engineering malware, then we need a Reverie backend that is hardened by emulating/rewriting the instruction stream carefully to protect against unsupported instructions or undefined conditions.
> Last question: how do you sanitize the RDTSC CPU instruction?
That one we can trap in userspace and then we return deterministic virtual time. Currently that time is a linear combination of the branches and system calls executed by all threads up to the current moment in time. For example, if you do RDTSC in a loop, you will see time advancing.
But using rdtsc to recreate fine-grained timers and implement side-channel attacks is impossible under Hermit. (Side channel attacks in general are only possible if first breaking the sandboxing in some way, and we don't know of a specific attack that will do the trick yet.)
That makes a lot of sense, thanks for all the answers!
It's definitely a very interesting project and I think that once more programs can work without changes, it is something that could perhaps be used for building packages by distros that have displayed a special interest in (and can benefit significantly from) reproducible builds -- like Debian for instance, but especially NixOS due to the existing work on content-addressed paths (explained here: https://www.tweag.io/blog/2020-09-10-nix-cas/ ).
The serialization of threads (and I assume processes?) would probably be an issue for large packages, but at least it would be possible to build many packages in parallel once the base dependencies have been built.
I had thought about starting such a project myself (for reproducible builds), but of course, the amount of work is very significant and for me it was not worth the benefits :)
> we need to implement a caching scheme to at least amortize the overhead of this sort for pathological cases. Still, we will need to trigger the sort even on a single read as you say. So there's a pathological case there -- listing one item from a ginormous directory -- that would run much slower than the native/nondeterministic version.
Actually, I just thought that if you decide to go down this path because of this specific pathological issue -- the cache that you're mentioning could simply be a file that Hermit maintains to represent a sorted view of a directory (and which could be deleted after Hermit finishes running if you so desire, although you could keep it in-between runs with some care!).
So you'd only have to pay the cost of reading the entire directory once per directory that is read -- specifically, on the first time that some program reads it.
You could even pre-generate such cached views for all directories in the container once Hermit starts, if you'd prefer to pay this cost at start-up (although I don't see why).
All subsequent operations relating to a directory could use the corresponding file as a cached and sorted view of it (and such operations that modify the directory would have to keep the file in sync, of course).
So even if a program reads a directory, then closes it, and some other program opens it and reads it, this second read would use the cache file instead, so it would not have to read the whole directory at once.
This would be even better for directories that are created by programs running inside Hermit, since the cache file would be created along with the directory when it is initially created, so there would be no need to read the entire directory at once at any time.
It would also mean that Hermit wouldn't have to potentially use large amounts of memory just because of this cache, or "forget" entries from the cache while it is running (to limit memory usage), since it would be stored persistently on-disk (and would still benefit from in-memory caching from the underlying filesystem / kernel).
Yeah, it's interesting to think about persisting the state we would need to make the file system more sympatico with Hermit. If we were willing to have a daemon.... Meta develops this "watchman" tool that our build infrastructure uses. I think for existing file systems we could imagine a daemon that watches the directory and caches what we need.
But if we could dream of new file systems, then I want one that is basically btrfs but with O(1) file-level parallel-hashes (rather than block level). Heck maybe it could even enforce a sorted order on directories for us too. The hashes would be very useful in record-and-replay scenarios, where you could know immediately whether the input files are in your content-addressible blob storage without having to hash them every time.
P.S. about Reproducible Builds -- we pitched this to the community in 2019 (at the Reproducible Builds Summit) and with that ASPLOS'20 paper, but we would be eager to see anyone wanting to pick it up and take it further.
The underlying Reverie instrumentation layer works on ARM, but Hermit isn't ported yet, and we haven't touched RISC-V yet at all. (Contributions welcome!)
One thing we haven't tried yet is just putting a whole emulator (qemu etc) underneath Hermit. That would address any sources of irreproducibility that the emulator lets through from the host (threads, RNG, etc).
Well, gettimeofday is a syscall, and we do intercept it (along with clock_gettime, clock_gettres, time, nanosleep, and the rdtsc instruction, even though that last one is not a syscall). When we intercept it, we report virtual time back to the guest. We make sure that virtual time is deterministic, across all threads in the container, irrespective of what the wall clock time is on the host machine.
So for instance, if there are multiple threads in a chess engine, and they are racing to write search results to a data structure, these threads will interleave in a reproducible order under Hermit, and the races will resolve consistently.
But the downside is that Hermit does sequentialize execution onto one core. So in the current version, a multithreaded program doesn't get actual wall-clock speedup from its parallelism. (The earlier dettrace did allow some limited guest parallelism, and we plan to bring that back.) For this reason, Hermit's good for consistent testing multithreaded software, but you wouldn't want to run parallel software under it outside of testing.
Yes, I'm very interested in that as well. I've been involved with the ACM Artifact Evaluation process which has been going on in several conferences for a while.
But it's been pretty frustrating. As an author, my PLDI 2014 artifact stopped working less than 5 years later (Docker image binary incompatibility). And when I was co-chair of an Artifact Evaluation Committee in 2017, there was not great reproducibilty of the artifacts that were submitted either.
If you package a VM (freezing the Linux kernel), and are pretty sure that VM will run in 10 years, PLUS you determinize the execution itself... that should allow durable bitwise reproducibility. Maybe Hermit could be one ingredient of that.
For scientific reproducibility, there is a lot of other tooling to build too, and I know some folks have been working in that area:
Alas the performance overhead in realtime is not great yet. It still uses ptrace currently, which often results in a multiple-X slowdown (but at least it doesn't "subscribe" to every syscall like strace does, because some are naturally deterministic). Reverie's whole design is to make it support swappable backends, and this ptrace backend is just the reference implementation. The `experimental/reverie-sabre` directory in the Reverie repo contains our high performance backend, but it's still work-in-progress. It uses binary instrumentation and in our early experiments is 10X faster than our current backend in the worst case (i.e. strace is >10X faster when rewritten with reverie-sabre and run on a program that does nothing but syscalls).
But to the second part of your question about deterministic benchmarking, that is really a separate question. Hermit defines a deterministic notion of virtual time, which is based on the branches retired and system calls executed by all threads. When you run hermit with `--summary`, it reports a total "Elasped virtual global time", which is completely deterministic:
$ hermit run --summary /bin/date
...
Elapsed virtual global (cpu) time: 5_039_700ns
Therefore, any program that runs under hermit can get this deterministic notion of performance. We figured that could be useful for setting performance regression tests with very small regression margins (<1%), which you can't do on normal noisy systems. Compilers are one place I've worked where we wanted smaller performance regression alarms (for generated code) than we could achieve in practice. We haven't actually explored this application yet though. There's a whole small field of people studying performance modeling and prediction, and if one wanted to try this deterministic benchmarking approach, they might want take some of that knowledge and build a more accurate (correlated with wall time) performance model, more realistic than Hermit's current virtual time that is.
Well, the starting datetime at the beginning of execution in the container is whatever you set it to:
$ hermit run --epoch=2022-01-01T00:00:00Z /bin/date
Fri Dec 31 16:00:00 PST 2021
We, somewhat eccentrically, put it in last millennium by default. It used to default to the original Unix epoch back in 12/31/1969, but that was causing some software to be very unhappy ;-).
The reproducibility guarantee is that the behavior of the program is a deterministic function of its initial configuration. The epoch setting is one aspect of that initial configuration (as are file system inputs, RNG seeds, etc).
> We, somewhat eccentrically, put it in last millennium by default.
Hah, that is still going to make some TLS software and their certificate tests very unhappy! Does it show that I've ran into a similar issue before? ;)
But of course, it's trivial to fix with the --epoch parameter :)
We could maybe have shared this logic to intercept syscalls and redirect them to user space code serving as the kernel. That is, we could have shared the Reverie layer. We saw ourselves as headed towards an in-guest binary instrumentation model (like rr’s syscall buffer). And so one factor is that Rust is a better fit than Go for injecting code into guest processes.
Regarding the actual gVisor user space kernel.. we could have started with that and forked it to start adding determinism features to that kernel. At first glance that would seem to save on implementation work, but “implement futexes deterministically” is a pretty different requirement than “implement futexes”, so it’s not clear how much savings could have been had.
We could still have a go at reusing their kvm setup to implement a Reverie backend. But there’s some impedance matching to do across the FFI there, with the Reverie API relying on Rusts async concurrency mode and Tokio. Hopefully we could cleanly manage separate thread pools for the go threads taking syscalls vs the Tokio thread pool hosting Reverie handler tasks. Or maybe it would be possible to reuse their solution without delivering each syscall to Go code.