Huh. So my initial response was, "why on earth would you need a whole OS for that", but memory snapshotting and improved virtual memory performance might actually be a good justification. Linux does have CRIU which might be made to work for such a purpose, but I could see a reasonable person preferring to do it from a clean slate. On the other hand, if you need qemu to run applications (which I'm really unclear about; I can't tell if the plan is to run stuff natively on this OS or just to provide enough system to run qemu and then run apps on linux on qemu) then I'm surprised that it's not easier to just make qemu do what you want (again, I'm pretty sure qemu already has its own memory snapshotting features to build on).
Of course, writing an OS can be its own reward, too:)
Oooh, wasn't really expecting this to make it to HN cause it was meant to be more of an announcement than a description.
But yes, I've done about 7 or 8 operating systems for fuzzing in the past and it's a massive performance (and cleanliness) cleanup. This one is going to be like an operating system I wrote 2-3 years ago for my vectorized emulation work.
To answer your QEMU questions, the goal is to effectively build QEMU with MUSL (just to make it static so I don't need a dynamic loader), and modify MUSL to turn all syscalls to `call` instructions. This means a "syscall" is just a call to another area, which will by my Rust Linux emulator. I'll implement the bare minimum syscalls (and enum variants to those syscalls) to get QEMU to work, nothing more. The goal is not to run Linux applications, but run a QEMU+MUSL combination which may be modified lightly if it means a lower emulation burden (eg. getting rid of threading in QEMU [if possible] so we can avoid fork())
The main point of this isn't performance, it's determinism, but that is a side effect. A normal syscall instruction involves a context switch to the kernel, potentially cr3 swaps depending on CPU mitigation configuration, and the same to return back. This can easily be hundreds of cycles. A `call` instruction to something that handles the syscall is on the order of 1-4 cycles.
While for syscalls this isn't a huge deal, it's even more emphasized when it comes to KVM hypercalls. Transitions to a hypervisor are very expensive, and in this case, the kernel, the hypervisor, and QEMU (eg. device emulation) will all be running at the same privilege level and there won't be a weird QEMU -> OS -> KVM -> other guest OS device -> KVM -> OS -> QEMU transition every device interaction.
But then again, it's mainly for determinism. By emulating Linux deterministically (eg. not providing entropy through times or other syscall returns), we can ensure that QEMU has no source of external entropy, and thus, will always do the same thing. Even if it uses a random-seeded hash table, the seed would be derived from syscalls, and thus, will be the same every time. This determinism means the guest always will do the same thing, to the instruction. Interrupts happen on the same instructions, context switches do, etc. This means any bug, regardless of how complex, will reproduce every time.
All of this syscall emulation + determinism I have also done before, in a tool called tkofuzz that I wrote for Microsoft. That used Linux emulation + Bochs, and it was written in userspace. This has proven incredibly successful and it's what most researchers are using at Microsoft now. That being said, Bochs is about 100x slower than native execution, and now that people have gotten a good hold of snapshot fuzzing (there's a steep learning curve), it's time to get a more performant implementation. With QEMU with get this with a JIT, which at least gets us a 2-5x improvement over Bochs while still "emulating", but even more value could be found if we get the KVM emulation working and can use a hypervisior. That being said, I do plan to support a "mode" where guests which do not touch devices (or more specifically, snapshots which are taken after device I/O has occurred) will be able to run without QEMU at all. We're really only using QEMU for device emulation + interrupt control, thus, if you take a snapshot to a function that just parses everything in one thread, without process IPC or device access (it's rare, when you "read" from a disk, you're likely just hitting OS RAM caches, and thus not devices), we can cut out all the "bloat" of QEMU and run in a very very thin hypervisor instead.
In fuzzing it's critical to have ways to quickly map and unmap memory as most fuzz cases last for hundreds of microseconds. This means after a few hundred microseconds, I want to restore all memory back to the state "before I handled user input" and continue again. This is extremely slow in every conventional operating system, and there's really no way around it. It's of course possible to make a driver or use CRIU, but these are still not exactly the solution that is needed here. I'd rather just make an OS that trivially runs in KVM/Hyper-V/Xen, and thus can run in a VM to get the cross-platform support, rather than writing a driver for every OS I plan to use this on.
Your YouTube looks awesome, and I just followed you on Twitch. Looking forward to a marathon stream. :)
A question about fuzzing and determinism:
It seems like this requirement of determinism is connected to (and reinforces) a brute force approach to discovering problematic inputs.
This precludes strategies like performing a truly stochastic search (where you put complete faith in the randomness) over your input space.
Are people attempting this kind of thing? Are there additional requirements to fuzzing that make probabilistically finding a thin trajectory through the input space undesirable?
Determinism is already fairly important in fuzzing for 2 major reasons.
One, is that having determinism makes it easier to triage bugs. If I find a crash while fuzzing, but there's no way to reproduce it, it's going to pretty much never get fixed (unless the call stack from the first crash is obvious enough to prepare a fix). For fuzzing systems, this is effectively mandatory. Imagine a memory allocation failure leading to a NULL-deref, the odds of hitting that same bug are so low because it requires going OOM at the same point in a subsequent execution. Snapshot fuzzing (each fuzz case starts from a snapshotted state of memory/registers/whatever) mitigates some of this, but there's still noise from context switches that will affect RNGs, which will affect heap layouts, which will affect everything on the system. That being said, most fuzzing out there publicly is not system's fuzzing, and thus this is typically not something people think about.
But two, is what most people use determinism for. In modern fuzzing, coverage guidance is pretty much mandatory. This means when new code is hit, to save off the input such that it can be built upon. At a very simple level, this means a problem which is 256^4, turns into a 256*4, as all requirements do not need to be satisfied simultaneously, as long as the previous requirements cause new code to get hit they can be built upon. Of course, if the program does not behave the same way every time, the noise can start to erode the value which is provided here, since you're not actually building upon the same thing.
I think you're mixing two different concepts up here.
Determinism in the context of fuzzing is related to the reproducibility of a program state. It's deterministic in the sense that all the inputs to the system are known and controlled. This allows us to repeat all inputs and reproduce the exact same behaviour as before, e.g. an error state we stumbled upon or an interesting program state we want to continue exploring.
This in no way precludes sampling the input space stochastically. Brute forcing by sampling all possible inputs sequentially is usually untenable and wasteful. However, once you do encounter a new program state, you'll be able to perfectly recreate it forever.
I think you're right, I was mixing two concepts and my question wasn't really about determinism.
Writing a specialized OS suggests to me that someone is very focused on... the best way I can describe it is cutting a fat trajectory through the input space. I am curious if anyone is spending their effort on sparser (but more intelligent) sampling of the input space instead.
Yes, there's a lot of work being done on more intelligent fuzzing. To throw some terms into the mix, there's coverage-guided fuzzing (which is now an old technique), concolic testing (which combines concrete execution with symbolic execution in order to reach new branches in a targetted way) and grammar fuzzers (which generate valid inputs according to a grammar).
These are not really mutually exclusive with the type of work gamozolabs is doing because even with hyperintelligent input generation, you still ideally want raw speed.
I’ve been watching this guy stream on Twitch and I can tell you that he is legit.
Also his streams are often insanely long, going 7 to 13 hours. So I only ever watch his streams live for a while and then I catch the remainder on VOD.
He also has a YouTube page with archive of past streams beyond the retention of Twitch.
Honestly the knowledge he shares is so interesting that I selfishly did not want other people to even know about it. But realistically speaking I am not going to have time to make real use of the knowledge myself anytime soon.
He puts out quality content and he deserves all the attention he can get. And also, even though competition to find security bugs and earn bounties might become too hard that I myself or you ever get to find one and claim some money, the products that we all use will be more secure the more people in the world that work on finding these bugs and reporting them.
Do you have any thoughts on how to approach the videos? I have a good OS and fuzzing background, but an 8 hour video seems like an ordeal and harder to extract value from than something written
I think my best advice would be that you tune in on one of the streams on Twitch when it is live and ask him about it. Then maybe the two of you can figure out what is most relevant to you of his content compared to what you already know?
Testing code via semi-random inputs[1]. The most common fuzzers, AFL-Fuzz[2] and libFuzzer[3] are coverage-guided: they compile the program with special instrumentation to determine code coverage, then call the program repeatedly, changing the inputs via genetic algorithm to try to maximize the code paths executed. When unexpected behavior is observed (typically the test harness crashing) the fuzzer saves the test's input for future use.
Basically automatic generation of test case inputs. It's non-deterministic, so it won't always find problems, but it can save a lot of manual effort.
Of course, writing an OS can be its own reward, too:)