I'm claiming that if a program is compiled with LLVM, it must follow must LLVM's...

anonymoushn · on March 1, 2023

Sure, I think I understand. The claim is maybe that it's legal for LLVM to emit code that (before every pointer access to a pointer obtained from outside of LLVM) somehow checks whether the pointer points to a region of memory that was actually allocated outside of LLVM and does different stuff based on the result of that check. In the face of such adversarial codegen on the part of LLVM, if someone wanted to implement this correctly on Linux, they might need to make sure they actually mapped the pages they wanted to use for crashing with PROT_NONE before using any pointers pointing into the crashing region. Is that right?

Do the docs actually define exactly which mechanisms external to LLVM count as allocating address ranges and which do not? It's possible that calling mmap and passing PROT_NONE does not count, for example.

rom-antics · on March 1, 2023

I wouldn't call codegen adversarial. The optimizer isn't out to get you. It emits the best code it can given a certain set of assumptions. It may just seem adversarial at times because the output can behave in unintuitive ways if you break those assumptions.

I don't believe PROT_NONE suffices. The address needs to be accessible, not merely mapped. If reading through a pointer, the address must be readable. If writing through a pointer, the address must be writeable. This is why writing to a string constant is undefined behavior, even though reading would be fine.

Another issue is alignment. If you read from a `*const i32` with unaligned pointer value 0x2, the optimizer is free to assume that code path is unreachable and, you guessed it, bulldoze your house. If you get a segfault from reading an `i32` from address 0x2, you've already hit UB and spun the roulette wheel.

In theory the emitted code could check pointers for alignment and validity (in whatever platform-specific way) before accessing them, and simulate a segfault if not. Such checks would serve as optimization barriers in LLVM, and prevent these instances of UB. Of course Zig's current ReleaseSafe doesn't do this, and I think it would be silly if it did. But that's the only way you could accurately call segfaults "well-defined".

anonymoushn · on March 2, 2023

I hope in the future we will get new LLVM semantics that are capable of expressing programs that use guard pages.