Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm claiming that if a program is compiled with LLVM, it must follow must LLVM's rules. One of those rules is that a pointer must be valid in order to be dereferenced. If a program attempts to dereference an invalid pointer and segfaults, it has broken those rules* and thus exhibited undefined behavior. While undefined behavior MAY result in a segfault, it's equally valid for the program to continue running with corrupted state and wipe your hard disk in a background thread.

I'm not sure how I can connect the dots any more clearly. Like gggggp said, it's baffling to see the creator of a popular language sweep the nasal demons under the rug and pretend that certain undefined behavior is guaranteed.

Calling such segfaults "safe" or "well-defined" is setting your users up for disappointment and CVEs, because a "well-defined" result is axiomatically impossible in the presence of undefined behavior. It's subtle, and if we were talking about a Java competitor maybe I could forgive the mistake. But if you're writing a low-level language it's important to understand how this stuff works. Ironically, he spread misinformation in the very post where he accused Rust evangelists of the same.

This thread is long dead and continuing the discussion seems futile, so I'll just leave it at that.

*excluding something silly like `raise(SIGSEGV)`



Sure, I think I understand. The claim is maybe that it's legal for LLVM to emit code that (before every pointer access to a pointer obtained from outside of LLVM) somehow checks whether the pointer points to a region of memory that was actually allocated outside of LLVM and does different stuff based on the result of that check. In the face of such adversarial codegen on the part of LLVM, if someone wanted to implement this correctly on Linux, they might need to make sure they actually mapped the pages they wanted to use for crashing with PROT_NONE before using any pointers pointing into the crashing region. Is that right?

Do the docs actually define exactly which mechanisms external to LLVM count as allocating address ranges and which do not? It's possible that calling mmap and passing PROT_NONE does not count, for example.


I wouldn't call codegen adversarial. The optimizer isn't out to get you. It emits the best code it can given a certain set of assumptions. It may just seem adversarial at times because the output can behave in unintuitive ways if you break those assumptions.

I don't believe PROT_NONE suffices. The address needs to be accessible, not merely mapped. If reading through a pointer, the address must be readable. If writing through a pointer, the address must be writeable. This is why writing to a string constant is undefined behavior, even though reading would be fine.

Another issue is alignment. If you read from a `*const i32` with unaligned pointer value 0x2, the optimizer is free to assume that code path is unreachable and, you guessed it, bulldoze your house. If you get a segfault from reading an `i32` from address 0x2, you've already hit UB and spun the roulette wheel.

In theory the emitted code could check pointers for alignment and validity (in whatever platform-specific way) before accessing them, and simulate a segfault if not. Such checks would serve as optimization barriers in LLVM, and prevent these instances of UB. Of course Zig's current ReleaseSafe doesn't do this, and I think it would be silly if it did. But that's the only way you could accurately call segfaults "well-defined".


I hope in the future we will get new LLVM semantics that are capable of expressing programs that use guard pages.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: