Hello World on the GPU (2019)

JonChesterfield · on Nov 16, 2023

As of this year (ish), `int main() {puts("hello, world\n");}` stands a decent chance of running on a GPU and doing the right thing if you compile it with clang. Terminal application style. Should be able to spell it printf shortly, variadic functions turn out to be a bit of a mess.

pjmlp · on Nov 16, 2023

CUDA already does printf, and C++20 support, minus modules.

JonChesterfield · on Nov 17, 2023

C++20 would be news to me. Do you have a reference? The closest I can find is https://github.com/NVIDIA/cccl which seems to be atomic and bits of algorithm. E.g. can you point to unordered_map that works on the target?

I think some pieces of libc++ work but don't know of any testing or documentation effort to track what parts, nor of any explicit handling in the source tree.

pjmlp · on Nov 17, 2023

Yes, it is right there on the documentation.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

Existing restrictions,

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

JonChesterfield · on Nov 17, 2023

Well, the docs say C++ support. There's a reference to <type_traits> and a lot of limitations on what you can do with lambda. I don't see supporting evidence that the rest of libc++ exists. I think what they mean is some syntax from C++20 is implemented, but the C++ library is not.

By "run C on the GPU" I'm thinking of taking programs and compiling them for the GPU. The lua interpreter, sqlite, stuff like that. I'm personally interested in running llvm on one. Not taking existing code, deleting almost all uses of libc or libc++ from it, then strategically annotating it with host/device/global noise and partitioning it into host and target programs with explicit data transfer.

That is, I don't consider "you can port it to cuda with some modern C++ syntax" to be "you can run C++", what with them being different languages and all. So it doesn't look like Nvidia have beaten us to shipping this yet.

Thank you for the reference.

Edit: a better link might be https://nvidia.github.io/libcudacxx/standard_api.html which shows an effort to port libc++, but it's early days for it. No STL data structures in there.

SubjectToChange · on Nov 17, 2023

IIRC, CUDA Toolkit 12.0 added partial support for C++20 in nvcc and nvrtc.

krackers · on Nov 17, 2023

What does that do under the hood though? What does it mean to execute puts from a GPU?

JonChesterfield · on Nov 17, 2023

Libc on x64 is roughly a bunch of userspace code over syscall which traps into the kernel. Looks like a function that takes six integer registers and writes results to some of those same registers.

Libc on nvptx or amdgpu is a bunch of userspace code over syscall, which is a function that takes eight integers per lane on the GPU. That "syscall" copies those integers to the x64/host/other architecture. You'll find it in a header called rpc.h, the same code compiled on host or GPU. Sometime later a thread on the host reads those integers, does whatever they asked for (e.g. call the host syscall on the next six integers), possibly copies values back.

Puts probably copies the string to the host 7*8 bytes at a time, reassembles it on the host, then passes it to the host implementation of puts. We should be able to kill the copy on some architectures. Some other functions run wholly on the GPU, e.g. sprintf shouldn't talk to the host, but fprintf will need to.

The GPU libc is fun from a design perspective because it can run code on either side of that communication channel as we see fit. E.g. printf floating point handling seems prone to large numbers of registers needed on the GPU at the moment so we may move some work to the host to make the register usage better (higher occupancy).

KeplerBoy · on Nov 16, 2023

Do you happen to have a link to these developments?

JonChesterfield · on Nov 16, 2023

Documentation is lagging reality a bit, we'll probably fix that around the next llvm release. Some information is at https://libc.llvm.org/gpu/using.html

That GPU libc is mostly intended to bring things like fopen to openmp or cuda, but it turns out GPUs are totally usable as bare metal embedded targets. You can read/write to "host" memory, on that and a thread running on the host you can implement a syscall equivalent (e.g. https://dl.acm.org/doi/10.1145/3458744.3473357), and once you have syscall the doors are wide open. I particularly like mmap from GPU kernels.

keldaris · on Nov 17, 2023

Is there a way to directly use these developments to already write a reasonable subset of C/C++ for simpler usecases (basically doing some compute and showing the results on screen by just manipulating pixels in a buffer like you would with a fragment/pixel shader) in a way that's portable (across the three major desktop platforms, at least) without dealing with cumbersome non-portable APIs like OpenGL, OpenCL, DirectX, Metal or CUDA? This doesn't require anything close to full libc functionality (let alone anything like the STL), but would greatly improve the ergonomics for a lot of developers.

JonChesterfield · on Nov 17, 2023

I'll describe what we've got, but fair warning that I don't know how the write pixels to the screen stuff works on GPUs. There are some instructions with weird names that I assume make sense in that context. Presumably one allocates memory and writes to it in some fashion.

LLVM libc is picking up capability over time, implemented similarly to the non-gpu architectures. The same tests run on x64 or the GPU, printing to stdout as they go. Hopefully standing up libc++ on top will work smoothly. It's encouraging that I sometimes struggle to remember whether it's currently running on the host or the GPU.

The datastructure that libc uses to have x64 call a function on amdgpu, or to have amdgpu call a function on x64, is mostly a blob of shared memory and careful atomic operations. That was originally general purpose and lived on a prototypey GitHub. Its currently specialised to libc. It should end up in an under-debate llvm/offload project which will make it easily reusable again.

This isn't quite decoupled from vendor stuff. The GPU driver needs to be running in the kernel somewhere. On nvptx, we make a couple of calls into libcuda to launch main(). On amdgpu, it's a couple of calls into libhsa. I did have an opencl loader implementation as well but that has probably rotted, intel seems to be on that stack but isn't in llvm upstream.

A few GPU projects have noticed that implementing a cuda layer and a spirv layer and a hsa or hip layer and whatever others is quite annoying. Possibly all GPU projects have noticed that. We may get an llvm/offload library that successfully abstracts over those which would let people allocate memory, launch kernels, use arbitrary libc stuff and so forth running against that library.

That's all from the compute perspective. It's possible I should look up what sending numbers over HDMI actually is. I believe the GPU is happy interleaving compute and graphics kernels and suspect they're very similar things in the implementation.

pjmlp · on Nov 17, 2023

CUDA allows for straight C++ for quite some time, that is how renderers like nanite are written.

https://docs.nvidia.com/cuda/cuda-c-std/index.html

"C++ Standard Parallelism"

https://www.youtube.com/watch?v=nwrgLH5yAlM

Or if you prefer more vendor neutral,

https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-...

Currently with C++17 support.

SubjectToChange · on Nov 17, 2023

I’m cautiously optimistic for SYCL. The absurd level of abstraction is a bit alarming, but single source performance portability would be a godsend for library authors.

pjmlp · on Nov 17, 2023

This is one area where I imagine C++ wannabe replacements like Rust having a very hard time taking over.

It took almost 20 years to move from GPU Assembly (DX 9 timeframe), shading languages, to regular C, C++, Fortran and Python JITs.

There are some efforts with Java, .NET, Julia, Haskell, Chapel, Futhark, however still trailing behind the big four.

Currently in terms of ecosystem, tooling and libraries, as far as I am aware, Rust is trailing those, and not yet being a presence on HPC/Graphics (Eurographics, SIGGRAPH) conferences.

SubjectToChange · on Nov 17, 2023

This is one area where I imagine C++ wannabe replacements like Rust having a very hard time taking over.

I 100% agree. Although I have a keen interest in Rust I can’t see it offering any unique value to the GPGPU or HPC space. Meanwhile C++ is gaining all sorts of support for HPC. For instance the parallel stl algorithms, mdspan, std::simd, std::blas, executors (eventually), etc. Not to mention all of the development work happening outside of the ISO standard, e.g. CUDA/ROCm(HIP)/OpenACC/OpenCL/OpenMP/SYCL/Kokkos/RAJA and who knows what else.

C++ is going to be sitting tight in compute for a long time to come.

pjmlp · on Nov 17, 2023

There is always the argument that it can help reduce the errors produced due to memory corruption.

However industry standards matter more.

SubjectToChange · on Nov 17, 2023

HPC researchers already employ some techniques to detect memory corruption, hardware flaws, floating point errors, and so on. Maybe Rust could meaningfully reduce memory errors, but if it comes at the cost of bounds checking (or any other meaningful runtime overhead) they will have absolutely zero interest.

pjmlp · on Nov 17, 2023

Chapel and Julia ongoing efforts proves otherwise, and I can tell from CERN days, not everyone uses those tools.

In any case, that means those languages are much better positioned than Rust in such ecosystem.

SubjectToChange · on Nov 17, 2023

If you’re willing to deal with 5 layers of C++ TMP, then a library like Kokkos will let you abstract over those APIs, or at least some of them. Eventually if or when SYCL is upstreamed in the llvm-project it’ll be possible to do it with clang directly.

KeplerBoy · on Nov 16, 2023

This is super interesting, thanks!

raytopia · on Nov 16, 2023

Great parody of WebGPU and other low level graphics apis.

runetech · on Nov 16, 2023

If nothing else, I am grateful for the introduction to Selah Sue (music that plays when you press, well.. the play symbol in the top animation).

Spectacular vibe! Combined with the fullscreen animation is almost reminiscent of the demo-scene. I enjoyed the rest of the actual web page much more after that.

I salute thee whoever made this. Much appreciated!

dragontamer · on Nov 16, 2023

There's a degree of GPU-style going on here, but its not OpenGL or DirectX.

  for y in 0..height {
    for x in 0..width {

      // Get target position
      let tx = x + offset;
      let ty = y;

So this code, in a language I'm not too familiar with, is clearly a GPU concept. Except, this 2-dimensional for-loop is executed in parallel on modern GPUs in the so-called pixel-shader.

A Pixel-shader is all sorts of complications in practice that deserves at least a few days of studying the rendering pipeline to understand. But the tl;dr is that a pixel-shader launches a thread (erm... a SIMD-lane? A... work-item? A shader?) per pixel, and then the device drivers do some magic to group them together.

Like, in the raw hardware, pixel0-0 is going to be rendered at the same time as pixel0-1, pixel0-2, etc. etc. And the values inside of this "for loop" are the code that runs it all.

Sure its SIMD and all kinds of complicated to fully describe what's going on here. But the bulk of GPU-programming (or at least, for pixel shaders), is recognizing the one-thread-per-pixel (erm, SIMD-lane per pixel) approach.

------------------

Anyway, I think this post is... GPU-enough. I'm not sure if this truly executes on a GPU given how the code was written. But I'd give it my stamp of approval as far as "Describing code as if it were being done on a GPU", even if they're cheating for simplicity in many spots.

The #1 most important part is that the "rasterize" routine is written in the embarrassingly parallel mindset. Every pixel "could" in theory, be processed in parallel. (Notice that no pixels have race-conditions or locks, or sequencing needed with each other).

And the #2 part is having the "sequential" CPU-code logically and seamlessly communicate with the "embarrassingly parallel" rasterize routine in a simple, logical, and readable manner. And this post absolutely accomplishes that.

Its harder to write this cleanly than it looks. But having someone show you, as per this post, how it is done helps with the learning process.

pjmlp · on Nov 16, 2023

It is a Rust application making use of wgpu, Rust's WebGPU native library.

dragontamer · on Nov 16, 2023

Nope.

Pixel shaders in WebGPU / wgpu are written in WGSL. The above 2-dimensional for-loop is _NOT_ a proper pixel shader (but it is written in a "Pixel Shader style", very familiar to any GPU programmer).

zorgmonkey · on Nov 16, 2023

The author didn't say it, but I'm pretty sure for-loop was meant to be pseudocode to help the reader understand what it does and not the actual implementation.

dragontamer · on Nov 16, 2023

I'm pretty sure this whole post is a shitpost. A well written joke, and one I enjoyed. But a shitpost nonetheless.

Upon closer inspection, the glyphs are each rendered onto the framebuffer sequentially... one-at-a-time. IE: NOT in an embarrassingly parallel manner. So the joke is starting to fall apart as you look closely.

But those kinds of details don't matter. The post is written well enough to be a good joke but no "better" than needed. (EDIT: It was written well enough to trick me in my first review of the article. But on 2nd and 3rd inspection, I'm noticing the problems, and its all in good fun to see the post degenerate into obvious satire by the end).

imtringued · on Nov 18, 2023

The rasterizer doesn't even do any rasterization. It just blends the already rasterized glyphs onto the screen.

Honestly it sounds like AI. This is a website in the shape/memory of a blogpost, not an actual blogpost.

pjmlp · on Nov 16, 2023

Because it is Rust code?!?

"...An easy tutorial in Rust"

A short visit to the authors blog clearly shows they know what they talk about.

Const-me · on Nov 16, 2023

It’s not just the language. That code is impossible to directly translate to a pixel shader because GPUs only implement fixed-function blending. Render target pixels (and depth values) are write-only in the graphics pipeline, they can be only loaded with fixed-function pieces of GPUs: blending, depth rejection, etc.

It’s technically possible to translate the code into compute shader/CUDA/OpenCL/etc., but that gonna be slow and hard to do, due to concurrency issues. You can’t just load/blend/store without a guarantee other threads won’t try to concurrently modify the same output pixel.

kimixa · on Nov 17, 2023

Tilers (mostly mobile and Apple) generally expose the ability to read & write the framebuffer value pretty easily - see things like GL_EXT_shader_framebuffer_fetch or vulkan's subpasses.

For immediate mode renderers (IE desktop cards), VK_EXT_fragment_shader_interlock seems available to correct those "concurrency" issues. DX12 ROVs seem to expose similar abilities. Though performance may be hit more than tiling architectures.

So you can certainly read-modify-write framebuffer values in pixel shaders using current hardware, which is what is needed for a fully shader-driven blending step.

hutzlibu · on Nov 16, 2023

"Graphics programming can be intimidating. It involves a fair amount of math, some low-level code, and it's often hard to debug. Nevertheless I'd like to show you how to do a simple "Hello World" on the GPU. You will see that there is in fact nothing to be afraid of."

57 created objects later

"Hm. Damn"

Well .. there is a reason it is usually "hello triangle" on GPU tutorials. Spoiler alert, GPUs ain't easy.

pjmlp · on Nov 16, 2023

3D APIs were easier back in the yearly days.

Nowadays using "legacy" APIs is relatively easy, however it requires a background knowledge on how GL became GL 4.6, DX became DX 11 and so.

Modern APIs are super low level, they are designed as GPU APIs for driver writers basically. Since they cut the fat legacy API drivers used to take care for the applications, now everyone has to deal with such complexity directly, or make use of a middleware engine instead.

kimixa · on Nov 16, 2023

What's wrong with using middleware engines? That was pretty much the expected case for people who didn't /need/ the level of control exposed by APIs like vulkan or dx12.

It's pretty much what OpenGL "drivers" were doing past the introduction of hardware shaders anyway - acting as a pretty thick middleware translating that to low-level commands, only having the user API being locked into a design from decades ago.

And considering how hard it was to get Khronos to eventually agree on vulkan in the first place (that effectively being a drop from AMD in "Mantle" then only tweaked by the committee), I'm not surprised they haven't standardized a higher-level API. So third party middleware it is.

pjmlp · on Nov 16, 2023

Nothing, middleware is exactly why most game studios hardly care about religious discussions regarding 3D APIs, as in FOSS circles.

I guess the problem is the expectation exactly by those that feel GL and DX 11 are done, and they need to directly use their replacements.

MaxBarraclough · on Nov 16, 2023

> now everyone has to deal with such complexity directly, or make use of a middleware engine instead

You don't really have to though, you can still use the higher-level older graphics APIs. It wouldn't have made much sense for Vulkan to include a high-level graphics API as well, as those APIs already exist and have mature ecosystems.

Similarly, in Windows land, you aren't forced to use D3D12, you can still use D3D11 or even D3D9.

pjmlp · on Nov 16, 2023

Kind of, many developers feel they are using legacy deprecated APIs, especially given how they are referred to in some circles.

Naturally that leads to some uncomfort.

raytopia · on Nov 16, 2023

Well if you use glBegin it's pretty easy.

glBegin(GL_TRIANGLES);

  glVertex3f( 0.0f, 1.0f, 0.0f);              
  glVertex3f(-1.0f,-1.0f, 0.0f);              
  glVertex3f( 1.0f,-1.0f, 0.0f);

glEnd();

And there you go you got a triangle.

It's great for beginners because they can see the results very fast and once they want to start having crazy graphical effects or need more performance you can move to shaders.

jacquesm · on Nov 16, 2023

The hardware was designed to display triangles, not to do 'Hello world'.

So it's not all that surprising that the one is easier than the other, in a way it is surprising that the other can be done at all. But as CPUs and GPUs converge it's quite possible that NV or another manufacturer eventually slips enough general purpose capacity onto their cards that they function as completely separate systems. And then 'Hello world' will be trivial.

hutzlibu · on Nov 16, 2023

Erm yes. A triangle is quite easy .. but here they tried a simple tutorial to actually print "Hello World" .. and surprise, it wasn't easy and in the end just stops.

raytopia · on Nov 16, 2023

It's not as easy as drawing triangles but glut which was a part of old school opengl had the function glutBitmapString which made it pretty easy to draw text in a few lines.