Lots of questions, but the main one is whether we have made any progress with these new toolchains and programming languages w/ respect to performance or robustness. And that may be unfair to ask of what is a genuinely useful tutorial.
I assume the point of these tutorials is not to show experts how to progress from the state of the art, but to show beginners how to get there. There are a few tutorials like this (build your own text editor, operating system, etc) and I think they are a great idea if done well.
The lemon parser generator is actually a delight to use if you're into that sort of thing. Paired with re2c you have a combination that rivals yacc/bison IMHO.
> I'm to sad to announce that
@OtterTuneAI
is officially dead. Our service is shutdown and we let everyone go today (1mo notice). I can't got into details of what happened but we got screwed over by a PE Postgres company on an acquisition offer.
https://x.com/andy_pavlo/status/1801687420330770841
Yes, and I've seen them in production. Once you learn the rules, you can parse them reasonably well in your head but it is still really confusing at a glance.
Not Llama but with Sonnet and O1 I wrote a bespoke android app for my company in about 8 hours of work. Once I polish it a bit (make a prettier UI), I'm pretty sure I could sell it to other companies doing our kind of work.
I am not a programmer, and I know C and Python at about a 1 day crash course level (not much at all).
However with sonnent I was able to be handheld all the way from downloading android studio to a functional app written in kotlin, that is now being used by employees on the floor.
People can keep telling themselves that LLMs are useless or maybe just helpful for quickly spewing boilerplate code, but I would heed the warning that this tech is only going to improve and already helping people forgo SWE's very seriously. Sears thought the internet was a cute party trick, and that obviously print catalogs were there to stay.
This is meaningless without talking about the capabilities of the app. I’ve seen examples of this before where non-programmers come up with something using an LLM that could just be a webpage with camera access and some javascript
Today, I wrote a full YouTube subtitle downloader in Dart. 52 minutes from starting to google anything about it, to full implementation and tests, custom formatting any of the 5 obscure formats it could be in to my exact whims. Full coverage of any validation errors via mock network responses.
I then wrote a web AudioWorklet for playing PCM in 3 minutes, which complied to the same interface as my Mac/iOS/Android versions, ex. Setting sample rate, feedback callback, etc. I have no idea what an AudioWorklet is.
Two days ago, I stubbed out my implementation of OpenAI's web socket based realtime API, 1400 LOC over 2 days, mostly by hand while grokking and testing the API. In 32 minutes, I had a brand spanking new batch of code, clean, event-based architecture, 86% test coverage. 1.8 KLOC with tests.
In all of these cases, most I needed to do was drop in code files and say, nope wrong a couple times to sonnet, and say "why are you violating my service contract and only providing an example solution" to o1.
Not llama 3.1 405B specifically, I haven't gone to the trouble of running it, but things turned some sort of significant corner over the last 3 months, between o1 and Sonnet 3.5. Mistakes are rare. Believable 405B is on that scale, IIRC it went punch for punch with the original 3.5 Sonnet.
But I find it hard to believe a Google L3, and third of L4s, (read: new hires, or survived 3 years) are that productive and sending code out for review at a 1/5th of that volume, much less on demand.
So insane-sounding? Yes.
Out there? Probably, I work for myself now. I don't have to have a complex negotiation with my boss on what I can use and how. And I only saw this starting ~2 weeks ago, with full o1 release.
Most software is not one off little utilities/scripts, greenfield small projects, etc... That's where LLMs excel, when you don't have much context and can regurgitate solutions.
It's less to do with junior/senior/etc.. and more to do with the types of problems you are tackling.
This is a 30KLOC 6 platform flutter app that's, in this user story, doing VOIP audio, running 3 audio models on-device, including in your browser. A near-replica of the Google Assistant audio pipeline, except all on-device.
It's a real system, not kindergarten "look at the React Claude Artifacts did, the button makes a POST request!"
The 1500 loc websocket / session management code it refactored and tested touches on nearly every part of the system (i.e. persisting messages, placing search requests, placing network requests to run a chat flow)
Also, it's worth saying this bit a bit louder: the "just throwing files in" I mention is key.
With that, the quality you observed being in reverse is the distinction: with o1 thinking, and whatever Sonnet's magic is, there's a higher payoff from working a larger codebase.
For example, here, it knew exactly what to do for the web because it already saw the patterns iOS/Android/macOS shared.
The bend I saw in the curve came from being ultra lazy one night and seeing what would happen if it just had all the darn files.
> This is a 30KLOC 6 platform flutter app […] It's a real system, not kindergarten "look at the React Claude Artifacts did, the button makes a POST request!"
This is powerful and significant but I think we need to ground ourselves on what a skilled programmer means when he talks about solving problems.
That is, honestly ask: What is the level of skill a programmer requires to build what you've described? Mostly, imo, the difficulties in building it are in platform and API details not in any fundamental engineering problem that has to be solved. What makes web and Android programming so annoying is all the abstractions and frameworks and cruft that you end up having to navigate. Once you've navigated it, you haven't really solved anything, you've just dealt with obstacles other programmers have put in your way. The solutions are mostly boilerplate-like and the code I write is glue.
I think the definition of "junior engineer" or "simple app" will be defined by what LLMs can produce and so, in a way, unfortunately, the goal posts and skill ceiling will keep shifting.
On the other, hand, say we watch a presentation by the Lead Programmer at Naughty Dog, "Parallelizing the Naughty Dog Engine Using Fibers"[^0] and ask the same questions: what level of skill is required to solve the problems he's describing (solutions worth millions of dollars because his product has to sell that much to have good ROI):
"I have a million LOC game engine for which I need to make a scheduler with no memory management for multithreaded job synchronization for the PS4."
A lot of these guys, if you've talked to them, are often frustrated that LLMs simply can't help them make headway with, or debug, these hard problems where novel hardware-constrained solutions are needed.
It's been pretty hard, but if you reduce it to "Were you using a framework, or writing one that needs to push the absolute limits of performance?"...
...I guess the first?...
...But not really?
I'm not writing GPU kernels or operating system task schedulers, but I am going to some pretty significant lengths to be running ex. local LLM, embedding model, Whisper, model for voice activity detection, model for speaker counting, syncing state with 3 web sockets. Simultaneously. In this case, Android and iOS are no valhalla of vapid-stackoverflow-copy-pasta-with-no-hardware-constraints, as you might imagine.
And the novelty is, 6 years ago, I would have targeted iOS and prayed. Now I'm on every platform at top-tier speeds. All that boring tedious scribe-like stuff that 90% of us spend 80% of our time on, is gone.
I'm not sure there's very many people at all who get to solve novel hardware-constrained problems these days, I'm quite pleased to brush shoulders with someone who brushes shoulders with them :)
Thus, smacks more of no-true-scotsman than something I can chew on. Productivity gains are productivity gains, and these are no small productivity gain in a highly demanding situation.
> Thus, smacks more of no-true-scotsman than something I can chew on.
I wasn't making a judgement about you or your work, after all I don't know you. I was commenting within the context of an app that you described for which an LLM was useful, relative to the hard problems we'll need help with if we want to advance technology (that is, make computers do more powerful things and do them faster). I have no idea if you're a true Scotsman or not.
Regardless: over the coming years we'll find out who the true Scotsmen were, as they'll be hired to do the stuff LLMs can't.
The challenging projects I've worked on are challenging not because slamming out code to meet requirements is hard (or takes long).
It's challenging because working to get a stable set of requirements requires a lot of communication with end users, stakeholders, etc. Then, predicting what they actually mean when implementing said requirements. Then, demoing the software and translating their non-technical understanding and comments back into new requirements (rinse and repeat).
If a tool can help program some of those requirements faster, as long as it meets security and functional standards, and is maintainable, it's not a big deal whether a junior dev is working with Stack Exchange or Claude, IMO. But I do want that dev to understand the code being committed, because otherwise security bugs and future maintenance headaches creep in.
I've definitely noticed the opposite on larger codebases. It's able to do magical things on smaller ones but really starts to fall apart as I scale up.
I think most software outside of the few Silicon Valleys of the world is in fact a bunch of dirty hacks put together.
I fully believe recursive application of current SOTA LLMs plus some deployment framework can replace most software engineers who work in the cornfields.
I don't understand what you guys are doing. For me sonnet is great when I'm starting with a framework or project but as soon as I start doing complicated things it's just wrong all the time. Subtly wrong, which is much worse because it looks correct, but wrong.
YouTube dlp has a subtitle option.
To quote the documentation:
"--write-sub Write subtitle file
--write-auto-sub Write automatically generated subtitle
file (YouTube only)
--all-subs Download all the available subtitles of
the video
--list-subs List all available subtitles for the
video
--sub-format FORMAT Subtitle format, accepts formats
preference, for example: "srt" or
"ass/srt/best"
--sub-lang LANGS Languages of the subtitles to download
(optional) separated by commas, use
--list-subs for available language tags"
This is a key point and one of the reasons why I think LLMs will fall short of expectation. Take the saying "Code is a liability," and the fact that with LLMs, you are able to create so much more code than you normally would:
The logical conclusion is that projects will balloon with code pushing LLMs to their limit, and this massive amount is going to contain more bugs and be more costly to maintain.
Anecdotally, supposedly most devs are using some form of AI for writing code, and the software I use isn't magically getting better (I'm not seeing an increased rate of features or less buggy software).
My biggest challenges in building a new application and maintaining an existing one is lack of good unit tests, functional tests, and questionable code coverage; lack of documentation; excessively byzantine build and test environments.
Cranking out yet more code, though, is not difficult (and junior programmers are cheap). LLMs do truly produce code like a (very bad) junior programmer: when trying to make unit tests, it takes the easiest path and makes something that passes but won't catch serious regressions. Sometimes I've simply reprompted it with "Please write that code in a more proper, Pythonic way". When it comes to financial calculations around dates, date intervals, rounding, and so on, it often gets things just ever-so-slightly wrong, which makes it basically useless for financial or payroll type of applications.
It also doesn't help much with the main bulk of my (paid) work these days, which is migrating apps from some old platform like vintage C#-x86 or some vendor thing like a big pile of Google AppScript, Jotform, Xapier, and so on into a more maintainable and testable framework that doesn't depend on subscription cloud services. So far I can't find a way to make LLMs productive at all at that - perhaps that's a good thing, since clients still see fit to pay decently for this work.
I don't understand - why does the existence of a CLI tool mean we're risking a grey goo situation if an LLM helps produce Dart code for my production Flutter app?
My guess is you're thinking I'm writing duplicative code for the hell of it, instead of just using the CLI tool - no. I can't run arbitrary binaries, at all, on at least 4 of the 6 platforms.
Apologies if it looked like I was singling out your comment. It was more that those comments brought the idea to mind that sheer code generation without skilled thought directing it may lead to unintended negative outcomes.
Something I noticed is that as the threshold for doing something like "write software to do X" decreases, the tendency for people to go search for an existing product and download it tends to zero.
There is a point where in some sense it is less effort to just write the thing yourself. This is the argument against micro-libraries as seen in NPM as well, but my point is that the threshold of complexity for "write it yourself" instead of reaching for a premade thing changes over time.
As languages, compilers, tab-complete, refactoring, and AI assistance get better and better, eventually we'll reach a point where the human race as a whole will be spitting out code at an unimaginable rate.
I agree with you that they are improving not being a programer I can't tell if the code has improved but as user that uses chat gpt or Google gemini to build scripts or trading view indicators. I am seeing some big improvements and many times wording it better detail and restricting it from going of tangent results in working code.
I learned over 20 years ago to _never_ post your code online for programmers to critique. Never. Unless you are an absolute pro (at which point you wouldn't even be asking for review), never do it.
I'm happy to provide whatever you ask for. With utmost deference, I'm sure you didn't mean anything by it and were just rushed, but just in case...I'd just ask that you'd engage with charity[^3] and clarity[^2] :)
I'd also like to point out[^1] --- meaning, I gave it my original code, in toto. So of course you'll see ex. comments. Not sure what else contributed to your analysis, that's where some clarity could help me, help you.
[^1](https://news.ycombinator.com/item?id=42421900) "I stubbed out my implementation of OpenAI's web socket based realtime API, 1400 LOC over 2 days, mostly by hand while grokking and testing the API. In 32 minutes, I had a brand spanking new batch of code, clean, event-based architecture, 86% test coverage. 1.8 KLOC with tests."
[^3](https://news.ycombinator.com/newsguidelines.html) "Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith."
i went to a groq event and one of their engineers told me they were running 7 racks!! of compute per (70b?) model. that was last year so my memory could be fuzzy.
iirc, groq used to be making resnet-500? chips? the only way such an impressive setup makes any kind of sense (my guess) would be they bought a bunch of resnet chips way back when and now they are trying to square peg in round hole that sunk cost as part of a fake it till you make it phase. they certainly have enough funding to scrap it all and do better... the question is if they will (and why they haven't been able to yet)
Yes, Groq requires hundreds or thousands of chips to load an LLM because they didn't predict that LLMs would get as big as they are. The second generation chip can't come soon enough for them.
So I interacted with people at cerebras at a tradeshow and it seems like you have to have extremely advanced cooling to keep that thing working. IIRC the user agreement says "you can't turn it off or else the warranty is void". With the way their chip is designed, I would be strongly worried that the giant chip has warping issues, for example, when certain cores are dark and the thermal generation is uneven (or, if it gets shut down on accident while in the middle of inferencing an LLM). There may even be chip-to-chip variation depending on which cores got dq'd based on their on-the-spot testing.
Already through the gapevine I'm hearing that H100s and B100s have to be replaced more often.... than you'd want? I suspect people are mum about it otherwise they might lose sweetheart discounts from nvidia. I can't imagine that cerebras, even with their extreme engineering of their cooling system, have truly solved cooling in a way that isn't a pain in the ass (otherwise they wouldn't have the clause?) and if I were building a datacenter I would be very worried about having to do annoying and capital intensive replacements.
I have nowhere near the knowledge required to say yes or no to your argument. My point is that the guy that wrote the article is shilling a pre-ipo company whole fuding the competitors which is really surprising to get that many upvotes.
maybe but it shouldn't be surprising. cerebras's designs were born ~2014 ~pre transformers, and the megachips were initially targetted for hpc workloads. it was definitely "solution looking for a problem" back then and now is drifting into square peg in round hole territory now (see sibling comment about groq). I'm surprised they have gotten their raw perf as high as they have by now.
* Your project is large enough that you are likely use using an unsupported libc function somewhere.
* Your project is small enough that you would benefit from just implementing a new kernel yourself.
I am biased because I avoid the C standard library even on the CPU, but this seems like a technology that raises the floor not the ceiling of what is possible.
> ... this seems like a technology that raises the floor not the ceiling of what is possible.
The root cause reason for this project existing is to show that GPU programming is not synonymous with CUDA (or the other offloading languages).
It's nominally to help people run existing code on GPUs. Disregarding that use case, it shows that GPUs can actually do things like fprintf or open sockets. This is obvious to the implementation but seems largely missed by application developers. Lots of people think GPUs can only do floating point math.
Especially on an APU, where the GPU units and the CPU cores can hammer on the same memory, it is a travesty to persist with the "offloading to accelerator" model. Raw C++ isn't an especially sensible language to program GPUs in but it's workable and I think it's better than CUDA.
>Disregarding that use case, it shows that GPUs can actually do things like fprintf or open sockets.
Can you elaborate on this? My mental model of GPU is basically like a huge vector coprocessor. How would things like printf or sockets work directly from the GPU when they require syscalls to trap into the OS kernel? Given that the kernel code is running on the CPU, that seems to imply that there needs to be a handover at some point. Or conversely even if there was unified memory and the GPU could directly address memory-mapped peripherals, you'd basically need to reimplement drivers wouldn't you?
It's mostly terminology and conventions. On the standard system setup, the linux kernel running in a special processor mode does these things. Linux userspace asks the kernel to do stuff using syscall and memory which both kernel and userspace can access. E.g. the io_uring register followed by writing packets into the memory.
What the GPU has is read/write access to memory that the CPU can also access. And network peripherals etc. You can do things like alternately compare-and-swap on the same page from x64 threads and amdgpu kernels and it works, possibly not quickly on some systems. That's also all that the x64 CPU threads have though, modulo the magic syscall instruction to ask the kernel to do stuff.
People sometimes get quite cross at my claim that the GPU can do fprintf. Cos actually all it can do is write numbers into shared memory or raise interrupts such that the effect of fprintf is observed. But that's also all the userspace x64 threads do, and this is all libc anyway, so I don't see what people are so cross about. You're writing C, you call `fprintf(stderr, "Got to L42\n");` or whatever, and you see the message on the console.
If fprintf compiles into a load of varargs mangling with a fwrite underneath, and the varargs stuff runs on the GPU silicon and the fwrite goes through a staging buffer before some kernel thread deals with it, that seems fine.
I'm pretty sure you could write to an nvme drive directly from the gpu, no talking to the host kernel at all, at which point you've arguably implemented (part of?) a driver for it. You can definitely write to network cards from them, without using any of this machinery.
We don't actually allow a GPU to directly fprintf, because GPU can't syscall. Only userspace can do that. You can have userspace keep polling and then do it on behalf of the GPU, but that's not the GPU doing it.
The GPU could do the equivalent of fprintf, if the concerned peripherals used only memory-mapped I/O an the IOMMU would be configured to allow the GPU to access directly those peripherals, without any involvement from the OS kernel that runs on the CPU.
This is the same as on the CPU, where the kernel can allow a user process to access directly a peripheral, without using system calls, by mapping that peripheral in the memory space of the user process.
In both cases the peripheral must be assigned exclusively to the GPU or the user process. What is lost by not using system calls is the ability to share the peripheral between multiple processes, but the performance for the exclusive user of the peripheral can be considerably increased. Of course, the complexity of the user process or GPU code is also increased, because it must include the equivalent of the kernel device driver for that peripheral.
At some point I was looking into using io_uring for something like this. The uring interface just works off of `mmap()` memory, which can be registered with the GPU's MMU. There's a submission polling setting, which means that the GPU can simply write to the pointer and the kernel will eventually pick up the write syscall associated with it. That would allow you to use `snprintf` locally into a buffer and then block on its completion. The issue is that the kernel thread goes to sleep after some time, so you'd still need a syscall from the GPU to wake it up. AMD GPUs actually support software level interrupts which could be routed to a syscall, but I didn't venture too deep down that rabbit hole.
> Lots of people think GPUs can only do floating point math.
IIRC, every Raspberry Pi is brought up by the GPU setting up the system before the CPU is brought out of reset and the bootloader looks for the OS.
> it is a travesty to persist with the "offloading to accelerator" model.
Operating systems would need to support heterogeneous processors running programs with different ISAs accessing the same pools of memory. I'd LOVE to see that. It'd be extremely convenient to have first-class processes running on the GPU MIMD cores.
I'm not sure there is much research done in that space. I believe IBM mainframe OSs have something like that because programmers are exposed to the various hardware assists that run as coprocessors sharing the main memory with the OS and applications.
Interesting - it resembles a network of heterogeneous systems that can share a memory space used primarily for explicit data exchange. Not quite what I was imagining, but probably much simpler to implement than a Unix where the kernel can see processes running on different ISAs on a shared memory space.
I guess hardware availability is an issue, as there aren't many computers with, say, an ARM, a RISC-V, an x86, and an AMD iGPU sharing a common memory pool.
OTOH, there are many where a 32-bit ARM shares the memory pool with 64-bit cores. Usually the big cores run applications while the small ARM does housekeeping or other low-latency task.
> Not quite what I was imagining, but probably much simpler to implement than a Unix where the kernel can see processes running on different ISAs on a shared memory space.
Indeed. The other argument is that treating the computer as a distributed system can make it scale better to say hundreds of cores compared to a lock-based SMP system.
Up to GPGPUs, there was no reason to build a machine with multiple CPUs of different architectures except running different OSs on them (such as the Macs, Suns and Unisys mainframes with x86 boards for running Windows side-by-side with a more civilized OS). With GPGPUs you have machines with a set of processors that are good on many things, but not great at SIMD and one that's awesome at SIMD, but sucks for most other things.
And, as I mentioned before, there are lots of ARM machines with 64-bit and ultra-low-power 32-bit cores sharing the same memory map. Also, even x86 variants with different ISA extensions can be treated as different architectures by the OS - Intel had to limit the fast cores of its early asymmetric parts because the low-power cores couldn't do AVX512 and OSs would not support migrating a process to the right core on an invalid instruction fault.
If the OS supports it, you can make programs that start threads on CPUs and GPUs and let those communicate. You run the SIMD-ish functions on the GPUs and the non-SIMD-heavy functions on the CPU cores.
I have a strong suspicion GPUs aren't as bad at general-purpose stuff as we perceive and we underutilize them because it's inconvenient to shuttle data over an architectural wall that's not really there in iGPUs.
Maybe it doesn't make sense, but it'd be worth looking into just to know where the borders of the problem lie.
Nah, they're pretty bad. They don't speculate or prefetch nearly as well as CPUs, and most code kind of relies on that to be fast. If you are programming for a GPU and you want to go fast you generally have to work quite hard for it.
> The root cause reason for this project existing is to show that GPU
> programming is not synonymous with CUDA (or the other offloading
> languages).
1. The ability to use a particular library does not reflect much on which languages can be used.
2. One you have PTX as a backend target for a compiler, obviously you can use all sorts of languages on the frontend - which NVIDIA's drivers and libraries won't even know about. Or you can just use PTX as your language - making your point that GPU programming is not synonymous with CUDA C++.
> It's nominally to help people run existing code on GPUs.
I'm worried you might be right. But - we should really not encourage people to run existing CPU-side code on GPUs, that's rarely (or maybe never?) a good idea.
> Raw C++ isn't an especially sensible language to program GPUs in
> but it's workable and I think it's better than CUDA.
CUDA is an execution ecosystem. The programming language for writing kernel code is "CUDA C++", which _is_ C++, plus a few builtins functions ... or maybe I'm misunderstanding this sentence.
GPU offloading languages - cuda, openmp etc - work something like:
1. Split the single source into host parts and gpu parts
2. Optionally mark up some parts as "kernels", i.e. have entry points
3. Compile them separately, maybe for many architectures
4. Emit a bunch of metadata for how they're related
5. Embed the GPU code in marked up sections of the host executable
6. Embed some startup code to find GPUs into the x64 parts
7. At runtime, go crawling around the elf section launching kernels
This particular library (which happens to be libc) is written in C++, compiled with ffreestanding target=amdgpu, to LLVM bitcode. If you build a test, it compiles to an amdgpu elf file - no x64 code in it, no special metadata, no elf-in-elf structure. The entry point is called _start. There's a small "loader" program which initialises hsa (or cuda) and passes it the address of _start.
I'm not convinced by the clever convenience cut-up-and-paste-together style embraced by cuda or openmp. This approach brings the lack of magic to the forefront. It also means we can add it to openmp etc when the reviews go through so users of that suddenly find fopen works.
CUDA C++ _can_ work like that. But I would say that these are mostly kiddie wheels for convenience. And because, in GPU programming, performance is king, most (?) kernel developers are likely to eventually need to drop those wheels. And then:
* No single source (although some headers might be shared)
* Kernels are compiled and linked at runtime, for the platform you're on, but also, in the general case, with extra definitions not known apriori (and which are different for different inputs / over the course of running your program), and which have massive effect on the code.
* You may or may not use some kind of compiled kernel caching mechanism, but you certainly don't have all possible combinations of targets and definitions available, since that would be millions or compiled kernels.
It should also be mentioned that OpenCL never included the kiddie wheels to begin with; although I have to admit it makes it less convenient to start working with.
that's clearly not a bad thing. however encouraging people to run mutating, procedural code with explicit loops and aliasing maybe isn't the right path to get there. particularly if you just drag forward all the weird old baggage with libc and its horrible string conventions.
I think any programming environment that treats a gpu as a really slow serial cpu isn't really what you want(?)
What if it encourages people to write parallel and functional code on CPUs? That'd be a good thing. Influence works both ways.
The bigger problem is that GPUs have various platform features (shared memory, explicit cache residency and invalidation management) that CPUs sadly don't yet. Sure, you could expose these facilities via compiler intrinsics, but then you end up code that might be syntactically valid C but is alien both to CPUs and human minds
On the contrary I would love that. The best case scenario in my mind is being able to express the native paradigms of all relevant platforms while writing a single piece of code that can then be compiled for any number of backends and dynamically retargeted between them at runtime. It would make debugging and just about everything else SO MUCH EASIER.
The equivalent of being able to compile some subset of functions for both ARM and x86 and then being able to dynamically dispatch to either version at runtime, except replace ARM with a list of all the GPU ISAs that you care about.
One thing this gives you is syscall on the gpu. Functions like sprintf are just blobs of userspace code, but others like fopen require support from the operating system (or whatever else the hardware needs you to do). That plumbing was decently annoying to write for the gpu.
These aren't gpu kernels. They're functions to call from kernels.
i wish people in our industry would stop (forever, completely, absolutely) using metaphors/allusions. it's a complete disservice to anyone that isn't in on the trick. it doesn't give you syscalls. that's impossible because there's no sys/os on a gpu and your actual os does not (necessarily) have any way to peer into the address space/schedular/etc of a gpu core.
what it gives you is something that's working really really hard to pretend be a syscall:
> Traditionally, the C library abstracts over several functions that interface with the platform’s operating system through system calls. The GPU however does not provide an operating system that can handle target dependent operations. Instead, we implemented remote procedure calls to interface with the host’s operating system while executing on a GPU.
Well, I called it syscall because it's a void function of 8 u64 arguments which your code stumbles into, gets suspended, then restored with new values for those integers. That it's a function instead of an instruction doesn't change the semantics. My favourite of the uses of that is to pass six of those integers to the x64 syscall operation.
This isn't misnaming. It's a branch into a trampoline that messes about with shared memory to give the effect of the x64 syscall you wanted, or some other thing that you'd rather do on the cpu.
There's a gpu thing called trap which is closer in behaviour to what you're thinking of but it's really annoying to work with.
Side note, RPC has a terrible rep for introducing failure modes into APIs, but that's completely missing here because pcie either works or your machine is gonna have to reboot. There are no errors on the interface that can be handled by the application.
> Well, I called it syscall because it's a void function of 8 u64 arguments which your code stumbles into, gets suspended, then restored with new values for those integers
I'm put it really simply: is there a difference (in perf, semantics, whatever) between using this "syscalls" to implement fopen on GPU and using a syscall to implement fopen on CPU? Note that's a rhetorical question because we both already know that the answer is yes. So again you're just playing slight of hand in calling them syscalls and I'll emphasize: this is a slight of hand that the dev himself doesn't play (so why would I take your word over his).
Wonderfully you don't need to trust my words, you've got my code :)
If semantics are different, that's a bug/todo. It'll have worse latency than a CPU thread making the same kernel request. Throughput shouldn't be way off. The GPU writes some integers to memory that the CPU will need to read, and then write other integers, and then load those again. Plus whatever the x64 syscall itself does. That's a bunch of cache line invalidation and reads. It's not as fast as if the hardware guys were on board with the strategy but I'm optimistic it can be useful today and thus help justify changing the hardware/driver stack.
The whole point of libc is to paper over the syscall interface. If you start from musl, "syscall" can be a table of function pointers or asm. Glibc is more obstructive. This libc open codes a bunch of things, with a rpc.h file dealing with synchronising memcpy of arguments to/from threads running on the CPU which get to call into the Linux kernel directly. It's mainly carefully placed atomic operations to keep the data accesses well defined.
There's also nothing in here which random GPU devs can't build themselves. The header files are (now) self contained if people would like to use the same mechanism for other functionality and don't want to handroll the data structure. The most subtle part is getting this to work correctly under arbitrary warp divergence on volta. It should be an out of the box thing under openmp early next year too.
The RPC implementation in LLVM is an adaptation of Jon's original state machine (see https://github.com/JonChesterfield/hostrpc). It looks very different at this point, but we collaborated on the initial design before I fleshed out everything else. Syscall or not is a bit of a semantic argument, but I lean more towards syscall 'inspired'.
The syscall layer this runs on was written at https://github.com/JonChesterfield/hostrpc, 800 commits from May 2020 until Jan 2023. I deliberately wrote that in the open, false paths and mistakes and all. Took ages for a variety of reasons, not least that this was my side project.
You'll find the upstream of that scattered across the commits to libc, mostly authored by Joseph (log shows 300 for him, of which I reviewed 40, and 25 for me). You won't find the phone calls and offline design discussions. You can find the tricky volta solution at https://reviews.llvm.org/D159276 and the initial patch to llvm at https://reviews.llvm.org/D145913.
GPU libc is definitely Joseph's baby, not mine, and this wouldn't be in trunk if he hadn't stubbornly fought through the headwinds to get it there. I'm excited to see it generating some discussion on here.
But yeah, I'd say the syscall implementation we're discussing here has my name adequately written on it to describe it as "my code".
Why does a perf difference factor into it? There is no requirement for a syscall to be this fast or else it isn't a syscall. If you have a hot loop you shouldn't be putting a syscall in it, not even on the CPU.
It's a matter of perspective. If you think of the GPU as a separate computer, you're right. If you think of it as a coprocessor, then the use of RPC is just an implementation detail of the system call mechanism, not a semantically different thing.
When an old school 486SX delegates a floating point instruction to a physically separate 487DX coprocessor, is it executing an instruction or doing an RPC? If RPC, does the same instruction start being a real instruction when you replace your 486SX with a 486DX, with an integrated GPU? The program can't tell the difference!
A 486SX never delegates floating point instructions, the 487 is a full 486DX that disables the SX and fully takes over, you are thinking of 386 and older.
> It's a matter of perspective. If you think of the GPU as a separate computer, you're right.
this perspective is a function of exactly one thing: do you care about the performance of your program? if not then sure indulge in whatever abstract perspective you want ("it's magic, i just press buttons and the lights blink"). but if you don't care about perf then why are you using a GPU at all...? so for people that aren't just randomly running code on a GPU (for shits and giggles), the distinction is very significant between "syscall" and syscall.
people who say these things don't program GPUs for a living. there are no abstractions unless you don't care about your program's performance (in which case why are you using a GPU at all).
The "proper syscall" isn't a fast thing either. The context switch blows out your caches. Part of why I like the name syscall is it's an indication to not put it on the fast path.
The implementation behind this puts a lot of emphasis on performance, though the protocol was heavilt simplfied in upstreaming. Running on pcie instead of the APU systems makes things rather laggy too. Design is roughly a mashup of io_uring and occam, made much more annoying by the GPU scheduler constraints.
The two authors of this thing probably count as people who program GPUs for a living for what it's worth.
Not everything in every program is performance critical. A pattern I've noticed repeatedly among CUDAheads is the idea that "every cycle matters" and therefore we should uglify and optimize even cold parts of our CUDA programs. That's as much BS on GPU as it is on CPU. In CPU land, we moved past this sophomoric attitude decades ago. The GPU world might catch up one day.
Are you planning on putting fopen() in an inner loop or something? LOL
The whole reason CUDA/GPUs are fast is that they explicitly don’t match the architecture of CPUs. The truly sophomoric attitude is that all compute devices should work like CPUs. The point of CUDA/GPUs is to provide a different set of abstractions than CPUs that enable much higher performance for certain problems. Forcing your GPU to execute CPU-like code is a bad abstraction.
Your comment about putting fopen in an inner loop really betrays that. Every thread in your GPU kernel is going to have to wait for your libc call. You’re really confused if you’re talking about hot loops in a GPU kernel.
> A pattern I've noticed repeatedly among CUDAheads is the idea that "every cycle matters" and therefore we should uglify and optimize even cold parts of our CUDA programs.
You're talking to the wrong people; this is definitely not true in general.
genuinely asking: where else should ML engineers focus their time, if not on looking at datapath bottlenecks in either kernel execution or the networking stack?
The point is that you should focus on the bottlenecks, not on making every random piece of code "as fast as possible". And that sometimes other things (maintainability, comprehensibility, debuggability) are more important than maximum possible performance, even on the GPU.
That's fair, but I didn't understand OP to be claiming above that "cudaheads" aren't looking at their performance bottlenecks before driving work, just that they're looking at the problem incorrectly (and eg: maybe should prioritize redesigns over squeezing perf out of flawed approaches.)
> A pattern I've noticed repeatedly among CUDAheads is the idea that "every cycle matters" and therefore we should uglify and optimize even cold parts of our CUDA programs
I don't know what a "cudahead" is but if you're gonna build up a strawman just to chop it down have at it. Doesn't change anything about my point - these aren't syscalls because there's no sys. I mean the dev here literally spells it out correctly so I don't understand why there's any debate.
ffmpeg has libraries. Why are you forking a separate process to call a single function? This is a very common problem I see in high level languages with poor C interoperability. Although cgo works fine, so what gives? Less code for more problems.
cgo doesn't just work fine, it's still treated very much as a second-class citizen by the Go tooling. It doesn't integrate with the standard build tooling properly and requires additional manual steps. You also can't really have an upstream dep that uses cgo internally, the burden falls onto the end binary to manually set the flags correctly.
And queries itself to get the schema: https://github.com/sqlite/sqlite/blob/802b042f6ef89285bc0e72...
Lots of questions, but the main one is whether we have made any progress with these new toolchains and programming languages w/ respect to performance or robustness. And that may be unfair to ask of what is a genuinely useful tutorial.