Because the OS has no say. A running program issues an assembly instruction to the CPU to read or write this register, and the CPU complies.
For the OS to have a say, the CPU would need to provide a way where the OS tells it (usually by setting certain values in other registers) that the CPU should not allow access, at least under certain circumstances.
The article actually does go into certain situations where the access is more restricted (search for "VHE"), but also in how that does not really apply here.
Yes, you can introduce new code but the kernel should also watch for that (JIT compilation etc.) and check the resulting code. It's quite involved, and the whole process looks more like a sandbox or emulator, but it's possible.
> originally I thought the register was per-core. If it were, then you could just wipe it on context switches. But since it's per-cluster, sadly, we're kind of screwed, since you can do cross-core communication without going into the kernel.
There is no indication that the M1 has updatable microcode, nor any other features that might allow such mitigation. (If it did, Apple would've fixed it; I did give them a 90 day disclosure warning and they're not lazy about fixing actual fixable bugs.)
There's more specific answers here, but in general the answer to this question is "only partly". The kernel is what initially gives your process a time slice on the CPU, by setting an alarm for the CPU to return control to the kernel at the end of the time slice, and then just jumping into your code. During your time slice, you can do anything you want to the CPU, and in general only interrupts (timer interrupts, hardware interrupts, page faults, etc) will cause the kernel to get involved again. There are some specific features that CPU designers add to give extra control to the kernel, but that's a feature of the CPU and it's only when the CPU has explicitly added that type of control.
> The kernel is what initially gives your process a time slice on the CPU, by setting an alarm for the CPU to return control to the kernel at the end of the time slice, and then just jumping into your code.
Somewhat critically, it will also drop down to EL0.
Registers aren't resources you access through syscalls, there's no way for the kernel to control them unless you're running under virtualization or the CPU architecture specifically allows access control for the register. (As the site notes, virtualization allows controlling access to this register)
Can kernel scan each page it maps as executable and return an error if it finds instructions interacting with the 'bad' register? Assuming the kernel requires executable pages to be read-only (W^X), this may even be doable (but probably very very slow).
It does require that, but it allows flipping between RX and RW at will (for JITs), and the M1 actually has proprietary features to allow userspace to do this without involving the kernel, so the kernel couldn't re-scan when those flips happen (plus it would kill performance anyway).
Plus, as I said above, this is prone to false positives anyway because the executable section on ARM also includes constant pools.
Ah, yes, I forgot about that. So indeed there is no non-racy hook point for the kernel to do such a check, even if it made sense and the RX/RW switch went through the kernel, which it doesn't.
> Because pthread_jit_write_protect_np changes only the current thread’s permissions, avoid accessing the same memory region from multiple threads. Giving multiple threads access to the same memory region opens up a potential attack vector, in which one thread has write access and another has executable access to the same region.
The kernel doesn't get a say in what instructions a userspace program can run, other than what the CPU is designed to allow it to control. The bug is the CPU designers forgot to allow it to control this one.
Let's say someone submits a malicious keyboard with the bad instructions hidden in a constant pool.
Apple can't just scan for a bad byte sequence in executable pages because it could also represent legitimate constants used by the program. (not sure if this part is correct?)
If so, doesn't that make detection via static analysis infeasible unless LLVM is patched to avoid writing bad byte sequences in constant pools? Otherwise they have to risk rejecting some small number of non-malicious binaries, which might be OK, depending on the likelihood of it happening.
I believe that Rice's theorem is about computability, not about whether or not it is possible to validate which CPU instructions a program can contain.
With certain restrictions, it is possible to do this: Google Native Client [1] has a verifier which checks that programs it executed did not jump into the middle of other instructions, forbade run-time code generation inside of such programs, etc.
(What other kinds of instructions? Genuinely asking.)
I don't think Rice's Theorem applies here. As a counterexample: On a hypothetical CPU where all instructions have fixed width (e.g. 32 bits), if accessing a register requires the instruction to have, say, the 10th bit set, and all other instructions don't, and if there is no way to generate new instructions (e.g. the CPU only allows execution from ROM), then it is trivial to check whether there is any instruction in ROM that has bit 10 set.
The next part I'm less sure how to state it rigorously (I'm not in the field): In our hypothetical CPU, I think disallowing that instruction either lets you remain being Turing Complete or not. In the former case, it's still the case that you can compute everything a Turing Machine can.
You'd have to add one extra condition to your hypothetical CPU: that it can't execute unaligned instructions. Given that, then yes, that lets you bypass Rice's theorem, even though it is indeed still Turing-complete.
But the M1 does have a way to "generate new instructions" (i.e., JIT), so that counterexample doesn't hold for it.
Yes, indeed, I should have stated "cannot execute unaligned instructions". Or have said 8 bit instead, then it would be immediately obvious what I mean. (You cannot jump into the middle of a byte because you cannot even address it.)
But I wanted to show how Rice's Theorem does not generally apply here. You can make up other examples: A register that needs an instruction with a length of 1000 bytes, yet the ROM only has 512 bytes space etc...
As for JIT, also correct (hence my condition), though that's also a property of the OS and not just the M1 (and on iOS for example, it is far more restricted what code is allowed to do JIT, as was stated in the thread already).
With the way Apple allows implementation of JIT on the M1 (with their custom MAP_JIT flag and pthread_jit_write_protect_np) it is actually possible to do this analysis even with JIT code. Since it enforces W^X (i.e. pages cannot be writable or executable at the same time) it gives the OS opportunity to inspect the code synchronously before it is rendered executable. Rosetta 2’s JIT support already relies on this kind of inspection to do translation of JIT apps.
It does when running native ARM code (but not x86 code), but AFAIK nothing stops Apple from changing this to being kernel mediated by updating libSystem in the ARM case as well. Of course I doubt they would take the performance hit just to get rid of a this issue.
1) the program does not contain an instruction that touches s3_5_c15_c10_1
2) the program contains an instruction that touches s3_5_c15_c10_1, but never executes that instruction
3) the program contains an instruction that touches s3_5_c15_c10_1, and uses it
Rice's theorem means we cannot tell whether a program will touch the register at runtime (as that's a dynamic property of the program). But that's because we cannot tell case 2 from case 3. It's perfectly decidable whether a program is in case 1 (as that's a static property of the program).
Any sound static analysis must have false positives -- but those are exactly the programs in case 2. It doesn't mean we end up blocking other kinds of instructions.