* Your project is large enough that you are likely use using an unsupported libc function somewhere.
* Your project is small enough that you would benefit from just implementing a new kernel yourself.
I am biased because I avoid the C standard library even on the CPU, but this seems like a technology that raises the floor not the ceiling of what is possible.
> ... this seems like a technology that raises the floor not the ceiling of what is possible.
The root cause reason for this project existing is to show that GPU programming is not synonymous with CUDA (or the other offloading languages).
It's nominally to help people run existing code on GPUs. Disregarding that use case, it shows that GPUs can actually do things like fprintf or open sockets. This is obvious to the implementation but seems largely missed by application developers. Lots of people think GPUs can only do floating point math.
Especially on an APU, where the GPU units and the CPU cores can hammer on the same memory, it is a travesty to persist with the "offloading to accelerator" model. Raw C++ isn't an especially sensible language to program GPUs in but it's workable and I think it's better than CUDA.
>Disregarding that use case, it shows that GPUs can actually do things like fprintf or open sockets.
Can you elaborate on this? My mental model of GPU is basically like a huge vector coprocessor. How would things like printf or sockets work directly from the GPU when they require syscalls to trap into the OS kernel? Given that the kernel code is running on the CPU, that seems to imply that there needs to be a handover at some point. Or conversely even if there was unified memory and the GPU could directly address memory-mapped peripherals, you'd basically need to reimplement drivers wouldn't you?
It's mostly terminology and conventions. On the standard system setup, the linux kernel running in a special processor mode does these things. Linux userspace asks the kernel to do stuff using syscall and memory which both kernel and userspace can access. E.g. the io_uring register followed by writing packets into the memory.
What the GPU has is read/write access to memory that the CPU can also access. And network peripherals etc. You can do things like alternately compare-and-swap on the same page from x64 threads and amdgpu kernels and it works, possibly not quickly on some systems. That's also all that the x64 CPU threads have though, modulo the magic syscall instruction to ask the kernel to do stuff.
People sometimes get quite cross at my claim that the GPU can do fprintf. Cos actually all it can do is write numbers into shared memory or raise interrupts such that the effect of fprintf is observed. But that's also all the userspace x64 threads do, and this is all libc anyway, so I don't see what people are so cross about. You're writing C, you call `fprintf(stderr, "Got to L42\n");` or whatever, and you see the message on the console.
If fprintf compiles into a load of varargs mangling with a fwrite underneath, and the varargs stuff runs on the GPU silicon and the fwrite goes through a staging buffer before some kernel thread deals with it, that seems fine.
I'm pretty sure you could write to an nvme drive directly from the gpu, no talking to the host kernel at all, at which point you've arguably implemented (part of?) a driver for it. You can definitely write to network cards from them, without using any of this machinery.
We don't actually allow a GPU to directly fprintf, because GPU can't syscall. Only userspace can do that. You can have userspace keep polling and then do it on behalf of the GPU, but that's not the GPU doing it.
The GPU could do the equivalent of fprintf, if the concerned peripherals used only memory-mapped I/O an the IOMMU would be configured to allow the GPU to access directly those peripherals, without any involvement from the OS kernel that runs on the CPU.
This is the same as on the CPU, where the kernel can allow a user process to access directly a peripheral, without using system calls, by mapping that peripheral in the memory space of the user process.
In both cases the peripheral must be assigned exclusively to the GPU or the user process. What is lost by not using system calls is the ability to share the peripheral between multiple processes, but the performance for the exclusive user of the peripheral can be considerably increased. Of course, the complexity of the user process or GPU code is also increased, because it must include the equivalent of the kernel device driver for that peripheral.
At some point I was looking into using io_uring for something like this. The uring interface just works off of `mmap()` memory, which can be registered with the GPU's MMU. There's a submission polling setting, which means that the GPU can simply write to the pointer and the kernel will eventually pick up the write syscall associated with it. That would allow you to use `snprintf` locally into a buffer and then block on its completion. The issue is that the kernel thread goes to sleep after some time, so you'd still need a syscall from the GPU to wake it up. AMD GPUs actually support software level interrupts which could be routed to a syscall, but I didn't venture too deep down that rabbit hole.
> Lots of people think GPUs can only do floating point math.
IIRC, every Raspberry Pi is brought up by the GPU setting up the system before the CPU is brought out of reset and the bootloader looks for the OS.
> it is a travesty to persist with the "offloading to accelerator" model.
Operating systems would need to support heterogeneous processors running programs with different ISAs accessing the same pools of memory. I'd LOVE to see that. It'd be extremely convenient to have first-class processes running on the GPU MIMD cores.
I'm not sure there is much research done in that space. I believe IBM mainframe OSs have something like that because programmers are exposed to the various hardware assists that run as coprocessors sharing the main memory with the OS and applications.
Interesting - it resembles a network of heterogeneous systems that can share a memory space used primarily for explicit data exchange. Not quite what I was imagining, but probably much simpler to implement than a Unix where the kernel can see processes running on different ISAs on a shared memory space.
I guess hardware availability is an issue, as there aren't many computers with, say, an ARM, a RISC-V, an x86, and an AMD iGPU sharing a common memory pool.
OTOH, there are many where a 32-bit ARM shares the memory pool with 64-bit cores. Usually the big cores run applications while the small ARM does housekeeping or other low-latency task.
> Not quite what I was imagining, but probably much simpler to implement than a Unix where the kernel can see processes running on different ISAs on a shared memory space.
Indeed. The other argument is that treating the computer as a distributed system can make it scale better to say hundreds of cores compared to a lock-based SMP system.
Up to GPGPUs, there was no reason to build a machine with multiple CPUs of different architectures except running different OSs on them (such as the Macs, Suns and Unisys mainframes with x86 boards for running Windows side-by-side with a more civilized OS). With GPGPUs you have machines with a set of processors that are good on many things, but not great at SIMD and one that's awesome at SIMD, but sucks for most other things.
And, as I mentioned before, there are lots of ARM machines with 64-bit and ultra-low-power 32-bit cores sharing the same memory map. Also, even x86 variants with different ISA extensions can be treated as different architectures by the OS - Intel had to limit the fast cores of its early asymmetric parts because the low-power cores couldn't do AVX512 and OSs would not support migrating a process to the right core on an invalid instruction fault.
If the OS supports it, you can make programs that start threads on CPUs and GPUs and let those communicate. You run the SIMD-ish functions on the GPUs and the non-SIMD-heavy functions on the CPU cores.
I have a strong suspicion GPUs aren't as bad at general-purpose stuff as we perceive and we underutilize them because it's inconvenient to shuttle data over an architectural wall that's not really there in iGPUs.
Maybe it doesn't make sense, but it'd be worth looking into just to know where the borders of the problem lie.
Nah, they're pretty bad. They don't speculate or prefetch nearly as well as CPUs, and most code kind of relies on that to be fast. If you are programming for a GPU and you want to go fast you generally have to work quite hard for it.
> The root cause reason for this project existing is to show that GPU
> programming is not synonymous with CUDA (or the other offloading
> languages).
1. The ability to use a particular library does not reflect much on which languages can be used.
2. One you have PTX as a backend target for a compiler, obviously you can use all sorts of languages on the frontend - which NVIDIA's drivers and libraries won't even know about. Or you can just use PTX as your language - making your point that GPU programming is not synonymous with CUDA C++.
> It's nominally to help people run existing code on GPUs.
I'm worried you might be right. But - we should really not encourage people to run existing CPU-side code on GPUs, that's rarely (or maybe never?) a good idea.
> Raw C++ isn't an especially sensible language to program GPUs in
> but it's workable and I think it's better than CUDA.
CUDA is an execution ecosystem. The programming language for writing kernel code is "CUDA C++", which _is_ C++, plus a few builtins functions ... or maybe I'm misunderstanding this sentence.
GPU offloading languages - cuda, openmp etc - work something like:
1. Split the single source into host parts and gpu parts
2. Optionally mark up some parts as "kernels", i.e. have entry points
3. Compile them separately, maybe for many architectures
4. Emit a bunch of metadata for how they're related
5. Embed the GPU code in marked up sections of the host executable
6. Embed some startup code to find GPUs into the x64 parts
7. At runtime, go crawling around the elf section launching kernels
This particular library (which happens to be libc) is written in C++, compiled with ffreestanding target=amdgpu, to LLVM bitcode. If you build a test, it compiles to an amdgpu elf file - no x64 code in it, no special metadata, no elf-in-elf structure. The entry point is called _start. There's a small "loader" program which initialises hsa (or cuda) and passes it the address of _start.
I'm not convinced by the clever convenience cut-up-and-paste-together style embraced by cuda or openmp. This approach brings the lack of magic to the forefront. It also means we can add it to openmp etc when the reviews go through so users of that suddenly find fopen works.
CUDA C++ _can_ work like that. But I would say that these are mostly kiddie wheels for convenience. And because, in GPU programming, performance is king, most (?) kernel developers are likely to eventually need to drop those wheels. And then:
* No single source (although some headers might be shared)
* Kernels are compiled and linked at runtime, for the platform you're on, but also, in the general case, with extra definitions not known apriori (and which are different for different inputs / over the course of running your program), and which have massive effect on the code.
* You may or may not use some kind of compiled kernel caching mechanism, but you certainly don't have all possible combinations of targets and definitions available, since that would be millions or compiled kernels.
It should also be mentioned that OpenCL never included the kiddie wheels to begin with; although I have to admit it makes it less convenient to start working with.
that's clearly not a bad thing. however encouraging people to run mutating, procedural code with explicit loops and aliasing maybe isn't the right path to get there. particularly if you just drag forward all the weird old baggage with libc and its horrible string conventions.
I think any programming environment that treats a gpu as a really slow serial cpu isn't really what you want(?)
What if it encourages people to write parallel and functional code on CPUs? That'd be a good thing. Influence works both ways.
The bigger problem is that GPUs have various platform features (shared memory, explicit cache residency and invalidation management) that CPUs sadly don't yet. Sure, you could expose these facilities via compiler intrinsics, but then you end up code that might be syntactically valid C but is alien both to CPUs and human minds
On the contrary I would love that. The best case scenario in my mind is being able to express the native paradigms of all relevant platforms while writing a single piece of code that can then be compiled for any number of backends and dynamically retargeted between them at runtime. It would make debugging and just about everything else SO MUCH EASIER.
The equivalent of being able to compile some subset of functions for both ARM and x86 and then being able to dynamically dispatch to either version at runtime, except replace ARM with a list of all the GPU ISAs that you care about.
One thing this gives you is syscall on the gpu. Functions like sprintf are just blobs of userspace code, but others like fopen require support from the operating system (or whatever else the hardware needs you to do). That plumbing was decently annoying to write for the gpu.
These aren't gpu kernels. They're functions to call from kernels.
i wish people in our industry would stop (forever, completely, absolutely) using metaphors/allusions. it's a complete disservice to anyone that isn't in on the trick. it doesn't give you syscalls. that's impossible because there's no sys/os on a gpu and your actual os does not (necessarily) have any way to peer into the address space/schedular/etc of a gpu core.
what it gives you is something that's working really really hard to pretend be a syscall:
> Traditionally, the C library abstracts over several functions that interface with the platform’s operating system through system calls. The GPU however does not provide an operating system that can handle target dependent operations. Instead, we implemented remote procedure calls to interface with the host’s operating system while executing on a GPU.
Well, I called it syscall because it's a void function of 8 u64 arguments which your code stumbles into, gets suspended, then restored with new values for those integers. That it's a function instead of an instruction doesn't change the semantics. My favourite of the uses of that is to pass six of those integers to the x64 syscall operation.
This isn't misnaming. It's a branch into a trampoline that messes about with shared memory to give the effect of the x64 syscall you wanted, or some other thing that you'd rather do on the cpu.
There's a gpu thing called trap which is closer in behaviour to what you're thinking of but it's really annoying to work with.
Side note, RPC has a terrible rep for introducing failure modes into APIs, but that's completely missing here because pcie either works or your machine is gonna have to reboot. There are no errors on the interface that can be handled by the application.
> Well, I called it syscall because it's a void function of 8 u64 arguments which your code stumbles into, gets suspended, then restored with new values for those integers
I'm put it really simply: is there a difference (in perf, semantics, whatever) between using this "syscalls" to implement fopen on GPU and using a syscall to implement fopen on CPU? Note that's a rhetorical question because we both already know that the answer is yes. So again you're just playing slight of hand in calling them syscalls and I'll emphasize: this is a slight of hand that the dev himself doesn't play (so why would I take your word over his).
Wonderfully you don't need to trust my words, you've got my code :)
If semantics are different, that's a bug/todo. It'll have worse latency than a CPU thread making the same kernel request. Throughput shouldn't be way off. The GPU writes some integers to memory that the CPU will need to read, and then write other integers, and then load those again. Plus whatever the x64 syscall itself does. That's a bunch of cache line invalidation and reads. It's not as fast as if the hardware guys were on board with the strategy but I'm optimistic it can be useful today and thus help justify changing the hardware/driver stack.
The whole point of libc is to paper over the syscall interface. If you start from musl, "syscall" can be a table of function pointers or asm. Glibc is more obstructive. This libc open codes a bunch of things, with a rpc.h file dealing with synchronising memcpy of arguments to/from threads running on the CPU which get to call into the Linux kernel directly. It's mainly carefully placed atomic operations to keep the data accesses well defined.
There's also nothing in here which random GPU devs can't build themselves. The header files are (now) self contained if people would like to use the same mechanism for other functionality and don't want to handroll the data structure. The most subtle part is getting this to work correctly under arbitrary warp divergence on volta. It should be an out of the box thing under openmp early next year too.
The RPC implementation in LLVM is an adaptation of Jon's original state machine (see https://github.com/JonChesterfield/hostrpc). It looks very different at this point, but we collaborated on the initial design before I fleshed out everything else. Syscall or not is a bit of a semantic argument, but I lean more towards syscall 'inspired'.
The syscall layer this runs on was written at https://github.com/JonChesterfield/hostrpc, 800 commits from May 2020 until Jan 2023. I deliberately wrote that in the open, false paths and mistakes and all. Took ages for a variety of reasons, not least that this was my side project.
You'll find the upstream of that scattered across the commits to libc, mostly authored by Joseph (log shows 300 for him, of which I reviewed 40, and 25 for me). You won't find the phone calls and offline design discussions. You can find the tricky volta solution at https://reviews.llvm.org/D159276 and the initial patch to llvm at https://reviews.llvm.org/D145913.
GPU libc is definitely Joseph's baby, not mine, and this wouldn't be in trunk if he hadn't stubbornly fought through the headwinds to get it there. I'm excited to see it generating some discussion on here.
But yeah, I'd say the syscall implementation we're discussing here has my name adequately written on it to describe it as "my code".
Why does a perf difference factor into it? There is no requirement for a syscall to be this fast or else it isn't a syscall. If you have a hot loop you shouldn't be putting a syscall in it, not even on the CPU.
It's a matter of perspective. If you think of the GPU as a separate computer, you're right. If you think of it as a coprocessor, then the use of RPC is just an implementation detail of the system call mechanism, not a semantically different thing.
When an old school 486SX delegates a floating point instruction to a physically separate 487DX coprocessor, is it executing an instruction or doing an RPC? If RPC, does the same instruction start being a real instruction when you replace your 486SX with a 486DX, with an integrated GPU? The program can't tell the difference!
A 486SX never delegates floating point instructions, the 487 is a full 486DX that disables the SX and fully takes over, you are thinking of 386 and older.
> It's a matter of perspective. If you think of the GPU as a separate computer, you're right.
this perspective is a function of exactly one thing: do you care about the performance of your program? if not then sure indulge in whatever abstract perspective you want ("it's magic, i just press buttons and the lights blink"). but if you don't care about perf then why are you using a GPU at all...? so for people that aren't just randomly running code on a GPU (for shits and giggles), the distinction is very significant between "syscall" and syscall.
people who say these things don't program GPUs for a living. there are no abstractions unless you don't care about your program's performance (in which case why are you using a GPU at all).
The "proper syscall" isn't a fast thing either. The context switch blows out your caches. Part of why I like the name syscall is it's an indication to not put it on the fast path.
The implementation behind this puts a lot of emphasis on performance, though the protocol was heavilt simplfied in upstreaming. Running on pcie instead of the APU systems makes things rather laggy too. Design is roughly a mashup of io_uring and occam, made much more annoying by the GPU scheduler constraints.
The two authors of this thing probably count as people who program GPUs for a living for what it's worth.
Not everything in every program is performance critical. A pattern I've noticed repeatedly among CUDAheads is the idea that "every cycle matters" and therefore we should uglify and optimize even cold parts of our CUDA programs. That's as much BS on GPU as it is on CPU. In CPU land, we moved past this sophomoric attitude decades ago. The GPU world might catch up one day.
Are you planning on putting fopen() in an inner loop or something? LOL
The whole reason CUDA/GPUs are fast is that they explicitly don’t match the architecture of CPUs. The truly sophomoric attitude is that all compute devices should work like CPUs. The point of CUDA/GPUs is to provide a different set of abstractions than CPUs that enable much higher performance for certain problems. Forcing your GPU to execute CPU-like code is a bad abstraction.
Your comment about putting fopen in an inner loop really betrays that. Every thread in your GPU kernel is going to have to wait for your libc call. You’re really confused if you’re talking about hot loops in a GPU kernel.
> A pattern I've noticed repeatedly among CUDAheads is the idea that "every cycle matters" and therefore we should uglify and optimize even cold parts of our CUDA programs.
You're talking to the wrong people; this is definitely not true in general.
genuinely asking: where else should ML engineers focus their time, if not on looking at datapath bottlenecks in either kernel execution or the networking stack?
The point is that you should focus on the bottlenecks, not on making every random piece of code "as fast as possible". And that sometimes other things (maintainability, comprehensibility, debuggability) are more important than maximum possible performance, even on the GPU.
That's fair, but I didn't understand OP to be claiming above that "cudaheads" aren't looking at their performance bottlenecks before driving work, just that they're looking at the problem incorrectly (and eg: maybe should prioritize redesigns over squeezing perf out of flawed approaches.)
> A pattern I've noticed repeatedly among CUDAheads is the idea that "every cycle matters" and therefore we should uglify and optimize even cold parts of our CUDA programs
I don't know what a "cudahead" is but if you're gonna build up a strawman just to chop it down have at it. Doesn't change anything about my point - these aren't syscalls because there's no sys. I mean the dev here literally spells it out correctly so I don't understand why there's any debate.
I've never understood why people say you "can't" do this or that on GPU. A GPU is made of SMs, and each SM is just a CPU with very wide SIMD pipes and very good hyperthreading. You can take one thread of a warp in a SM and do exactly the same things a CPU would do. Would you get 1/32 potential performance? Sure. But so what? Years ago, we did plenty of useful work with less than 1/32 of a modest CPU, and we can again.
One of the more annoying parts of the Nvidia experience is PTX. I. I know perfectly well that your CPU/SM/whatever has a program counter. Let me manipulate it directly!
Why would that matter? Adding extra CPU cores to the GPU sounds like a much dumber idea to me. You'd be wasting silicon on something that is meant to be used infrequently.
Is it really that difficult to imagine 1% of your workload doing some sort of management code, except since it is running on your GPU, you can now freely intersperse those 1% every 50 microseconds without too much overhead?
You put “can’t” in quotes, so I guess you are quoting somebody, but I don’t see where the quote is from, so I’m not sure what they actually meant.
But I suspect they are using “can’t” informally. Like: You can’t run a drag race in reverse. Ok, you technically could, but it would be a silly thing to do.