Is not one of the LLVM rules, pointers must be valid and have a valid provenance...

AndyKelley · on Feb 27, 2023

I invite you to share a snippet from the LLVM language reference[1] that backs up your interpretation.

I will return the courtesy, with regards to my interpretation:

> An integer constant other than zero or a pointer value returned from a function not defined within LLVM may be associated with address ranges allocated through mechanisms other than those provided by LLVM. Such ranges shall not overlap with any ranges of addresses allocated by mechanisms provided by LLVM. [2]

[1]: https://llvm.org/docs/LangRef.html

[2]: https://llvm.org/docs/LangRef.html#pointer-aliasing-rules

rom-antics · on Feb 28, 2023

From the same section,

- Any memory access must be done through a pointer value associated with an address range of the memory access, otherwise the behavior is undefined.

- A null pointer in the default address-space is associated with no address.

A null pointer (0x0) is associated with no address, therefore it has no address range. So if you do attempt a memory access (dereference), the behavior is undefined. QED. A naive translation to assembly would indeed segfault on a modern OS, but LLVM's optimizations are free to assume that code path is unreachable and do anything else.

Once the program is in this state, a bug of some kind is unavoidable. I don't take issue with that - what I take issue with is your claim that this behavior is well-defined, because it definitely is not. It would be equally valid for a null dereference to corrupt your program state or wipe your hard disk.

AndyKelley · on Feb 28, 2023

You have already admitted that 0x1, 0x2, etc. are fine. Your remaining argument rests entirely on the incorrect premise that Zig's only option is to lower to LLVM IR using the default address space.

rom-antics · on Feb 28, 2023

I don't think 0x2 is a valid pointer either. The docs say the pointer value must be "associated with address ranges allocated through mechanisms..." - to me the word "allocated" means it's the result of an allocation, pointing at usable address space. (Sorry, I know this is a purely semantic argument. Debating the meaning of words does not make for very interesting discussion.)

In Rust for example, derefencing a raw pointer is unsafe - because that pointer could have a value of 0x2 - which would result in undefined behavior according to LLVM.

tbh I'm surprised any of this is even up for debate. If you google "is segfault undefined behavior" you'll get 100 results telling you yes, yes it is.

anonymoushn · on Feb 28, 2023

Are you claiming that any program that segfaults exhibits undefined behavior within LLVM semantics, even those that were not compiled by LLVM? Or within some other set of semantics shared by all programs that can segfault?

rom-antics · on March 1, 2023

I'm claiming that if a program is compiled with LLVM, it must follow must LLVM's rules. One of those rules is that a pointer must be valid in order to be dereferenced. If a program attempts to dereference an invalid pointer and segfaults, it has broken those rules* and thus exhibited undefined behavior. While undefined behavior MAY result in a segfault, it's equally valid for the program to continue running with corrupted state and wipe your hard disk in a background thread.

I'm not sure how I can connect the dots any more clearly. Like gggggp said, it's baffling to see the creator of a popular language sweep the nasal demons under the rug and pretend that certain undefined behavior is guaranteed.

Calling such segfaults "safe" or "well-defined" is setting your users up for disappointment and CVEs, because a "well-defined" result is axiomatically impossible in the presence of undefined behavior. It's subtle, and if we were talking about a Java competitor maybe I could forgive the mistake. But if you're writing a low-level language it's important to understand how this stuff works. Ironically, he spread misinformation in the very post where he accused Rust evangelists of the same.

This thread is long dead and continuing the discussion seems futile, so I'll just leave it at that.

*excluding something silly like `raise(SIGSEGV)`

anonymoushn · on March 1, 2023

Sure, I think I understand. The claim is maybe that it's legal for LLVM to emit code that (before every pointer access to a pointer obtained from outside of LLVM) somehow checks whether the pointer points to a region of memory that was actually allocated outside of LLVM and does different stuff based on the result of that check. In the face of such adversarial codegen on the part of LLVM, if someone wanted to implement this correctly on Linux, they might need to make sure they actually mapped the pages they wanted to use for crashing with PROT_NONE before using any pointers pointing into the crashing region. Is that right?

Do the docs actually define exactly which mechanisms external to LLVM count as allocating address ranges and which do not? It's possible that calling mmap and passing PROT_NONE does not count, for example.

rom-antics · on March 1, 2023

I wouldn't call codegen adversarial. The optimizer isn't out to get you. It emits the best code it can given a certain set of assumptions. It may just seem adversarial at times because the output can behave in unintuitive ways if you break those assumptions.

I don't believe PROT_NONE suffices. The address needs to be accessible, not merely mapped. If reading through a pointer, the address must be readable. If writing through a pointer, the address must be writeable. This is why writing to a string constant is undefined behavior, even though reading would be fine.

Another issue is alignment. If you read from a `*const i32` with unaligned pointer value 0x2, the optimizer is free to assume that code path is unreachable and, you guessed it, bulldoze your house. If you get a segfault from reading an `i32` from address 0x2, you've already hit UB and spun the roulette wheel.

In theory the emitted code could check pointers for alignment and validity (in whatever platform-specific way) before accessing them, and simulate a segfault if not. Such checks would serve as optimization barriers in LLVM, and prevent these instances of UB. Of course Zig's current ReleaseSafe doesn't do this, and I think it would be silly if it did. But that's the only way you could accurately call segfaults "well-defined".

anonymoushn · on March 2, 2023

I hope in the future we will get new LLVM semantics that are capable of expressing programs that use guard pages.