I am totally out of the loop here, but wasn't the main problem with Nouveau the lack of documentation from Nvidia? They had to reverse engineer things to get it working.
I get that some of the architectural choices no longer make sense, and starting from scratch will address those. But is the goal to have performance that is somewhat comparable to the proprietary drivers? Or just good enough to run the desktop environment with hardware acceleration.
The problem with Nouveau was that Nvidia hardware could not be reclocked (the cores were basically always in low power mode) without a signed firmware blob. That blob couldn't be legally distributed except by Nvidia, so the open source folks more or less gave up on a high performance driver.
Now, for newer hardware, Nvidia has changed some aspects of the firmware and allows redistribution. So it's feasible to make a good open source driver.
AMD and Intel also use different drivers for different hardware generations, since eventually things change so much that it's better to start clean.
With regards to reverse engineering, Mesa has a number of reverse engineered drivers. That isn't anything new.
the problem was the nouveau would do this before when firmware was easy to intercept from the driver (and could be extracted easily). the newer driver uses different methods of firmware upload and it's a real chore to do so now.
so nouveau gave up on it. they also expected nvidia to drop some firmware.
now that newer cards have GSP.bin firmware, which can be interfaced with easily - things are different. i would wager a guess that it's similar to atombios from amd. you just call a function in GSP and it knows what registers to poke with the right values to achieve what you need.
My understanding is that nowadays most of the heavy lifting is done by magic going on in the firmware, so the actual driver is relatively simple and is open source: https://github.com/NVIDIA/open-gpu-kernel-modules
Any efforts by Nvidia are welcome, but that raises the question of why Red Hat are writing a new driver. Presumably there is some aspect of the OSS component that is dissatisfying Red Hat.
It is weird for a 3rd party to be maintaining a 2nd driver when the first party has a reasonable OSS driver.
There are a few things someone would pay Red Hat dearly to do. In my opinion the most likely are:
- No telemetry.
- Enabling software blocked features.
- Emulation.
There are few possibilities about making a performant OSS driver:
- This is impracticable, because the software you want to run on the GPU is so complex, and the hardware is so complex, that you will never get enough insights to make something that compares with the people who can see it all.
- This is eminently practicable because: (1) the software is much simpler, and the GPU hardware much simpler. Perhaps there is a lot of obfuscation of the simplicity. Or (2) the application that best utilizes the hardware only needs a limited feature set that is within scope.
I'm leaning on "application limited scope" and "telemetry." It aligns best with what is actually happening, which is NVIDIA is scooping up a lot of valuable intelligence on LLM workloads; and that there isn't enough competition for LLM "ASICs" to make them cheap enough to be worthwhile.
I never understood all the hate we in the OSS world have with telemetry. Why? I mean if nvidia and red hat paid for this to be developed wouldn’t it help them continue to develop it by knowing how and where and how often it was used?
Apple, which collects a lot of telemetry, made privacy a tenet of their brand, and they've successfully conflated the two in the mind of a layperson.
Separately, most telemetry would tell stories like, "This project is a failure." Little incentive for people to adopt it. Then again, most OSS is ordered by the mania of its programmer-creators, not product or engineering quality informed by telemetry. Maybe in a Darwinian way, we only have the OSS that can thrive without telemetry and reactive product and engineering decisions.
Debian, RedHat and Firefox do have telemetry. Not a lot, but enough to prove it is possible to do it in a way that doesn't piss most people off.
Naturally the way they collect it is open source, it's largely de-identified, and because it open source you verify it's de-identified enough for you. And if it isn't you can turn it off.
So it's possible to do well, where "well" means gets you the data you without pissing off the users. Most proprietary don't bother to do it well for whatever reason.
But they should be careful: it a big factor in diving things like Home Assistant, Linux Desktop and now this, apparently.
Because it's never just for engineering purposes anymore. Marketing departments are CRAZY about consumer data of any kind and you can be assured the value of such collected data will be maximized on the information exchange market. Advertisers pay good money for trustable data streams, which can be essentially de-anonymized with enough cross correlation between sources. A GPU happens to be a good place to capture user activity, since from the screen buffer you can tell what website they visit, what videos they watch and what games they play.
This stuff is rather hush hush not to scare people but it's also well documented and certainly not an unfounded conspiracy. Its also why Europeans have adopted far reaching privacy laws (GDPR), their industries don't rely on consumer surveillance the way Americans have developed for the last decades.
The telemetry aspect is easily resolved by any enterprise that cares enough about it.
The conclusion that this is more to do with the overall architecture and deployment of the existing driver is much more plausible.
> The telemetry aspect is easily resolved by any enterprise that cares enough about it.
Yeah, this is how, by writing your own driver. NVIDIA sells you turnkey DGX machines. It doesn't give you firmware. You have to be Internet-connected to refresh your various licenses, at some point, which is the moment the telemetry is shared. Google "NVIDIA telemetry."
If you are using NVIDIA on the cloud, well all bets are off. You are using their drivers. Amazon can't force you, in your VM, to install a different driver for the GPU you are using - there's no alternative to the proprietary one. Hence, my theory for why Red Hat could be paid to do this.
Telemetry is something that NVIDIA doesn't budge on for enterprises. You're welcome to see for yourself and start a sales call. I hear they're pretty busy.
> but that raises the question of why Red Hat are writing a new driver
Nvidia's driver cannot be included in the upstream linux as it doesn't follow kernel coding style and code organization, but more importantly it is tightly coupled to a single version of their GSP firmware - they have to be updated at the same time.
Shader compilation happens on CPU, and is neither simple nor (for the nvidia compiler) open source.
GPU drivers are significantly more complicated than any other drivers because they do things that are not considered part of a driver for any other device, the proprietary nV driver contains roughly as much code as the linux kernel.
I'm not sure if it's different for Nvidia, but the Gamer Nexus video about Intel drivers suggest a lot of the work getting good performance is done on the driver side for all GPU vendors: https://www.youtube.com/watch?v=Qp3BGu3vixk
Well, concerning memory safety, that might be a more important feature than you think. Allow me to quote a comment I just got an email notification for on a GNOME thread about NVIDIA:
> I wish I could be more optimistic about mutter!3304 coming soon, though... it's been 5 months already, and it doesn't seem anywhere closer to being merged into main To make matters worse, patched mutter 45.5 seems to be causing use-after-free on NVIDIA's kernel driver.
That argument is questionable. C is trivial. The PC architecture is not, and the Linux architecture absolutely not. What makes kernel programming hard is that there is a lot of knowledge that must be internalized, exactly how the DMA works for certain hardware, or why this specific bit need to be set in that lock. There's often no simple way of debugging or single stepping these things without simulating hardware.
The kernel could have been written in assembly with macros, it wouldn't make its development any harder. The best Rust can do is not make kernel more difficult than it already is.
Linus himself made that point a few months ago (https://youtu.be/YyRVOGxRKLg). If you take a look at how devoid large, important, old OSS projects are of young talent in their 20s, 30s and sometimes even 40s it's looking incredibly bleak.
Rust is technically an improvement over memory unsafe language but it also has created enthusiam among largely young and proficient coders. The ecosystem has a lot of dynamism. Drawing those people in is one of the, if not the single most important thing for the health of projects going forward with how many leaders are close to retirement.
Not sure I agree with that. C is a very easy language to learn. The problem with it is that you have to be careful with memory management. That does take effort to learn, and I still make mistakes after writing C for over 20 years.
I love Rust, but it is a very complex language, with a complex, full-featured stdlib, and rich, sometimes-inscrutable type system. It is more difficult to learn than C, but also more difficult (and sometimes impossible) to make many of the mistakes that you can make in C.
Rust isn't that hard to learn, and many of its user come from managed languages who never dared to learn C, because “it's too hard”.
C looks easy on the surface, but the syntax is pretty dated and full of footguns (yes even just the syntax, not even talking about UBs), and learning the language is a pretty intimidating experience because every time you think you know something, you actually don't and get bitten later.
Rust on the other hand is a good language for CS students: you have a lot of things to assimilate upfront, but when you've reached the level required to fulfill the class, you're actually ready to use it in production, and the resulting code produced by a sophomore will be more stable than C code written by wizards.
I know of two old syntactic footguns (assignment vs equality comparison; precedence of bitwise operators) and a single new one no one cares about (sizeof(int)+1 vs sizeof(int){+1}). The rest of the common bugbears look to be caused solely by bad pedagogy (the declaration syntax is not TYPE NAME; the switch statement is a computed goto not a multiarm conditional; there is no separate struct declaration). What am I missing?
Yep. I like to compare C to Brainfuck. Brainfuck is an even smaller, simpler language than C. You can learn Brainfuck in 10 minutes or less. If a language being easy to learn implies that it's easy to use productively, then Brainfuck ought to be one of the most productive languages on the planet! Of course in practice it's so small & so simple & provides so little to the programmer that it's nearly impossibly difficult to use for non-trivial programs.
The ease of writing software in a given programming language is not a linear function of the complexity of the programming language.
Very simple languages are very difficult to use. Very complex languages are very difficult to use. Languages of intermediate complexity tend to be much easier to use than those at the extremes. C is more towards the "simple" extreme than the ideal, Rust is (IMO) a bit more towards the "complex" extreme than the ideal, but is closer to the ideal than C.
I do somewhat doubt that. `no_std` Rust is still quite different from normal, userspace Rust. You don't get all the fancy libraries or whatever (of course Linus would probably just veto them anyway) and the Linux kernel development model is probably quite different (IIRC no cargo and whatnot).
So… what should we reckon? Would there a difference in getting new developers into `no_std` Rust in the kernel, and how different would that be versus having people learn freestanding C, with all the kernel add-ons and nicknacks?
Systems programming is always a full step sideways from traditional applications and I do see where you are coming from.
I would still reckon that having familiarity with the standard rust (even with std) will still have more programmers willing to make the leap than learn C for this one project.
I know of less than a handful of C projects starting in 2024, I know there is a bunch in rust , even with no_std.
Yeah it’s a weird take. The hard part about learning rust is borrow semantics and understanding the kinds of architectures the compiler will and won’t allow.
Well maybe it's just me, but the borrow semantics were never all that bad. Of course there were things that needed to be relearnt, but it's not all that bad all things considered.
Coming from C, the borrow semantics are basically what I try to get anyway, violating them only with suspicion and great care. Rust just makes it easy to check that I didn't screw up what I was already trying to do.
20 years is not a very long time and people are still learning c. My 17 year old intern learned c in college and prefers it over ‘modern crap’ (his words, but he wasn’t talking about rust; JavaScript). We do embedded (low powered mcu’s with a few kb of mem) and I too prefer c over rust for it; it’s a lot easier to attract c than rust people in my experience for this work. Most never touched rust and wouldn’t see a reason for it. Many of them are young.
> My 17 year old intern learned c in college and prefers it over ‘modern crap’
I had a lot of opinions like this too when I was 17, but age and experience has disabused me of many of them.
The embedded story on Rust still has many rough edges, but it's improving every day, and I could easily see it replace C in many places, given enough time (I wouldn't be too surprised if we eventually see companies distributing a BSP that's written in Rust).
Frankly, at this point in time, I think it's foolish to start a new project in C unless you have a really good reason to do so. Many embedded systems certainly qualify as a really good reason, but I very much hope that reason diminishes over time.
> I had a lot of opinions like this too when I was 17, but age and experience has disabused me of many of them.
Sure, I didn't say I agree with them, I barely remember when I was 17, it was that long ago.
I'm saying that young people that I meet are more interested in c so the 'in 20 years no-one knows c' is not exactly true.
We try Rust now and then; it's not worth it yet in my opinion for what we do. The tooling and libs we have for c are vast and like said, c people are really easy to get, Rust not so much.
I hope this changes, but for now it's just too much of a struggle to warrant it. And I was only responding to the fear of not having capable c devs in 20 years. There will be plenty.
That's a very optimistic view on the future of Rust, and a very pessimistic view on the future of C. Chances are that C will be longer around than Rust just because of the Lindy Effect ;)
Kernel C isn't userspace C. The bar for kernel C is much higher and less forgiving than userspace c.
I can't talk about 'telco grade c', because I have never experienced it. I have however seen telco submitted code to upstream kernel and its not above the average quality.
Honestly? Given I've seen crashes and printk messages from AMDGPU with words like "General Protection Fault," I'd say memory safety is probably the most important thing missing in these GPU drivers.
A coworker of mine formerly worked on Nvidia's proprietary drivers. They remarked at just how wrong the Nouveau driver was about certain things, including the functionality and reasons for certain registers.
Around a decade ago, I was introduced to a couple people who it turned out worked at nvidia. After a bunch of conversation, nouveau came up; they were intensely negative about it, mocking it and its developers, criticizing its existence, even, and declaring it would never amount to anything. (Meanwhile, I was running it on a computer at home and it was significantly more stable than the nvidia proprietary driver at the time, and performed well enough for my use.)
I wish I'd had more conversational courage at the time to "stand up" for nouveau with them; it was super rude to shit on someone else's hard work, especially when the reason why they had to do that hard work was because nvidia was a bunch of jerks with a shitty FOSS policy.
That's completely fucked up and has soured my opinion of nvidia and the people involved with it more than anything they've ever officially done as a company.
NVIDIA started publishing more of their driver as open-source in 2022, which while not hardware documentation probably helps a lot.
You may be wondering why Red Hat is bothering with this effort then? I assume it’s so that the code can be added to Linux directly as opposed to being out of tree.
They're still reverse engineering. The hardware is so good that people would rather use an Nvidia GPU with reverse engineered drivers than AMD with open documentation and official open source drivers. Same with Apple and Asahi.
I was hoping they officially partnered with NVIDIA in that effort, perhaps eventually even replacing the proprietary one on Linux. But then I awoke from that dream, it's less exciting. Still cool though.
Unfortunately Linux Desktop still seems to be too irrelevant for NVIDIA. All current drivers have issues with suspend, multiple displays and Wayland among other things.
It probably makes it even worse, its the enterprise/data center cards for ML that are making huge profits and where the growth is, and those don't even have a video out.
The kernel driver manages things like display setup, memory management (isolation between processes), and funneling of commands from user space to the hardware. This is more a replacement for Nouveau than NVK.
NVK generates the commands to send to the hardware, by converting vulkan APIs to Nvidia-specific instructions, and feeds them to the kernel driver.
This new driver (along with the modules Nvidia have released as OSS [1]) are kernel drivers and do not break userspace. NVK is a Vulkan implementation for Nvidia GPUs as part of the Mesa project (the same project that AMD's userspace drivers are a part of, even though AMDs kernel drivers are in tree). They are both needed to make a cohesive driver.
I don't get it - they want to sell hardware. Microsoft v Linux v MacOS aside, why don't they care at all? I can understand Linux might get a little ... fickle. But why weigh into this? Put all your energy into drivers for all, and sell more product.
>and without the Nouveau baggage that's built up over the years in supporting NVIDIA GPUs going back to its early days.
Not supporting most nvidia hardware is not a good thing. I hope it's adoption by commercial users (home users don't have the latest nvidia gpu generally) doesn't mean Nouveau bitrots from lack of companies with deep pockets paying for dev work. As for the inherent memory safety, well, remember how safe memory safe java was?
Java the language was memory safe. The runtime and libraries weren’t all memory safe. The JIT’s can break memory safety. Applets had integration issues. So, what hurt Java were opportunities to go around memory safety instead of attacking that memory-safe code.
Memory-safe, system languages (eg Ada, Rust) address this by having as much as possible in memory safety with either little or zero runtime. Rust makes things safe that weren’t before. It has unsafe which is less used than some other languages. The type checker can also catch many logic-level errors. The safety situation is better with Rust than Java code.
That doesn’t mean it will solve NVIDIA users’ problems, though. They are usually worried about compatibility, performance, and reliability. Rust mostly helps in one area. We’ll see about the rest, especially compatibility, as you said.
Java is forever stuck in my head as a bad language because it's UI library performance were so bad. I grew up being annoyed with anything written in Java because of its poor performance on Mac OS X. Notable was the persistent UI lag with anything written in Java.
Java was crippled by a terrible standard UI library and a failure to recognize the opportunity for a natively-compiled systems language other than C++.
JetBrains is doing quite well with such terrible UI library, with I would take any day over Electron crap, and AOT compilers have been available since 2000, even if only on commercial JDKs, which naturally did not did well on communities that won't pay for their work tools.
I think the GP is talking about AWT, which is indeed terrible. JetBrains uses Swing, which is... still not really the best, but IIRC they've done a lot of work to write their own abstractions and helpers and whatnot on top of it to make it more bearable.
Swing is quite powerful, the only downside is that it requires reading books like "Filthy Rich Clients: Developing Animated and Graphical Effects for Desktop Java Applications" to actually understand how to take advantage of its features, and the bad decision to not use by default the platform's L&F.
> JetBrains uses Swing, which is... still not really the best, but IIRC they've done a lot of work to write their own abstractions and helpers and whatnot on top of it to make it more bearable.
Swing feels pretty okay to me, at least in the times I've used it, especially when IDEs have GUI builders when you just want to do RAD.
I do wonder whatever happened to JavaFX, there was some hype around it years back but it doesn't seem like it got super widespread adoption: https://openjfx.io/
JavaFX was born as a scripting language[0], which most devs didn't like, and then Sun started the process to port it to Java.
In the middle of this, Sun went bankrupt and while Oracle took the JavaFX development further, they didn't see any value in adding yet another GUI framework to Java, and made it open source to the community by the time Java 11 was released.
A company, Gluon took over it [1], making their business case as means to use JavaFX to also target mobile OSes [3]. They also took over the JavaFX GUI designer. [2]
That was mostly ignored, as Java strong points focused on the server, Android is its own story, with their own frameworks, thus Swing was good enough for the market of desktop applications written in Java.
Additionally it didn't help that JavaFX is an additional dependency, with binary libraries, which kind of complicates the deployment story.
However, recently Oracle decided to be a bit more supportive of JavaFX, if it still matters, remains to be seen.
Actually AWT wasn't awful. It was certainly more performant than swing, in fact swing was the main thing that had me avoiding UI work and eventually dropping java in favour of .net
I thought jetbrains used their own UI layer for their IDE, or am I confusing it with Eclipse there?
indeed, already new official nvidia drivers don't support the aging Nvidia card I have in my desktop at home (the desktop is newer than the video card, I moved it from an older computer).
I had this problem at $JOB-1 where I had a big fat Dell desktop workstation with an nVidia card in it. I don't remember the model but it was a "pro" grade device: Fire or Quadro or some such BS marketing name.
There were lots of old LCD monitors around the office. As soon as I could, I went dual screen. Then I decided to go triple-head.
Problem.
I found a 2nd card, another nVidia, but it was a different GPU generation. The same Nouveau driver can run it, but not the nVidia driver, and you can't have two nVidia drivers installed side-by-side.
I tried multiple nVidia cards from the office spares pile but I couldn't find two the same generation of GPU for ages. When I finally did, the next kernel upgrade nuked the nVidia legacy driver -- it only supports a certain range of kernel versions.
In the end, a colleague took pity and lent me one of his own cards, an old gamer's card with 4 outputs, 3 usable at once. Perfect.
But when he wanted it back, to sell it, it was back into driver hell.
In the end, I managed to get a huge fat double-slot AMD card from the IT department, with 4 outputs, and it Just Worked™ with a FOSS driver.
NVidia driver versions are a massive cluster-fsck and perfectly good working cards are now e-waste because nVidia doesn't maintain its proprietary drivers, doesn't support more than a handful of GPU families in each release, and won't let you install >1 driver at a time.
I am not a gamer. I don't give a rat's ass about 3D performance or CUDA or any of those toys. I just want a shedload of pixels in front of me, updated quickly, with some screens in portrait and some in landscape. (And a desktop that properly supports vertical panels so I can use those screens in any orientation I want with effective use of space.)
Looks like the oldest card nouveau supports is the Riva TNT from 1998. I think it's fine for nouveau to support that and a new driver not to. I didn't get the impression that nouveau is going away any time soon.
When's the last time you have compiled the kernel? I bet waiting for Rust drivers to compile on a modern CPU should not take much longer than the last time I compiled the kernel by hand on a Core 2 Duo, circa 2013
>When's the last time you have compiled the kernel?
2010. I needed to do so for my Linux Kernel university class.
It took 8 hours on this piece of junk Pentium 4 I found in the school's e-waste bin. You know those Black Dell Optiplex machines that were everywhere like a plague.
They were days of such stress but also so much bliss.
In 2005, i compiled linux kernel in 15 minutes on AMD Duron 700. Is true that was a very optimized kernel for my architecture, but i think that the typical kernel with all modules could get 1 hour or something.
not sure how much can really be done wrong, worst case would probably be an allyesconfig, which is huge, and while it was still big in 2010, it was considerably less so than now
1. "Nobody compiles a kernel these days". Sure, maybe not for daily use for most, but lots of people do for work and experimentation. Also, Gentoo.
2. "Everybody runs a modern CPU". Plenty of people still run decade-old hardware, and even older, because it works and in that decade objectives of PC have not significantly changed, so why would hardware?
I honestly do not think the Linux Kernel team should worry too much about the very small niche of people that insist on compiling on slow CPUs. Cross compiling exists, and 99% of Linux users run a precompiled vmlinuz binary.
On the other hand, the Rust language team should do their utmost to improve the speed of their compiler.
Sigh. So yet another rewrite, which will result in yet another round of regressions. Hopefully, at the end of it we will have something better than yet another unofficial driver with partial functionality that lags behind in support for newer hardware.
It's a real shame that many things in Linux land are so badly designed/maintained that they have to be re-written from scratch every few years. The major exception being the base kernel itself I suppose.
Which regressions for Nvidia hardware are you talking about?
Nouveau supports seriously ancient Nvidia hardware. It's perfectly reasonable to make a clean break once in a while (both AMD and Intel have done it in their open source drivers).
Even though the newest nvidia drivers have started to support GBM, the Wayland compatibility story is still not great. In my experience on Wayland, several OpenGL programs refuse to work, and Vulkan does not work at all. This is with driver 545.
Ok long story, I'll try to keep it short.
New motherboard, Asus w680 ace ipmi. It has an extra ipmi card.
Nvidia-drm.modeset = 1 is required for Wayland to render anything other than 1920×1080.
I have a 4k monitor.
I insert the IPMI card, disable the aspeed gpu on the card via jumper.
The card is still active. But the BCM isn't working, correct cabling and all.
The Aspeed card being active clashes with the Nvidia 4070 card.
I have to disable modeset.
Effectively I can't do Wayland with the IPMI card plugged in.
Bonus fun: I ordered the motherboard via Amazon DE, they imported it from Amazon US, because with import taxes it's over 150€ cheaper than directly from DE.
I contact Asus support (which was hard enough, I can not talk to Asus US support only DE), and waste about a month doing pointless things, going past the instant return period of Amazon.
They tell me to contact ASUS RMA.
I can only get in touch with ASUS DE RMA.
They tell me to talk to Amazon, because I don't have an invoice, Amazon refuses to give me one because it's an imported product.
I contact Amazon 2 times.
Both times some Indian support people promise me the heavens, I have to tell them, "I can't send the motherboard back, I'm using it and about 2400€ worth of components on it, I use the motherboard for work".
They say, no problem we'll send you a replacement in 3 days.
2 months have now passed, no replacement in sight.
amd uses the gbm properly, nvidia is half cooked and previously used eglstreams instead of because they did not like gbm. reality is every compositor just uses gbm.
I've used the prop-drivers for a couple of years now with absolutely zero issues so I'd say so. But I guess your use case might vary. I don't use Wayland.
I get that some of the architectural choices no longer make sense, and starting from scratch will address those. But is the goal to have performance that is somewhat comparable to the proprietary drivers? Or just good enough to run the desktop environment with hardware acceleration.