Any efforts by Nvidia are welcome, but that raises the question of why Red Hat are writing a new driver. Presumably there is some aspect of the OSS component that is dissatisfying Red Hat.
It is weird for a 3rd party to be maintaining a 2nd driver when the first party has a reasonable OSS driver.
There are a few things someone would pay Red Hat dearly to do. In my opinion the most likely are:
- No telemetry.
- Enabling software blocked features.
- Emulation.
There are few possibilities about making a performant OSS driver:
- This is impracticable, because the software you want to run on the GPU is so complex, and the hardware is so complex, that you will never get enough insights to make something that compares with the people who can see it all.
- This is eminently practicable because: (1) the software is much simpler, and the GPU hardware much simpler. Perhaps there is a lot of obfuscation of the simplicity. Or (2) the application that best utilizes the hardware only needs a limited feature set that is within scope.
I'm leaning on "application limited scope" and "telemetry." It aligns best with what is actually happening, which is NVIDIA is scooping up a lot of valuable intelligence on LLM workloads; and that there isn't enough competition for LLM "ASICs" to make them cheap enough to be worthwhile.
I never understood all the hate we in the OSS world have with telemetry. Why? I mean if nvidia and red hat paid for this to be developed wouldn’t it help them continue to develop it by knowing how and where and how often it was used?
Apple, which collects a lot of telemetry, made privacy a tenet of their brand, and they've successfully conflated the two in the mind of a layperson.
Separately, most telemetry would tell stories like, "This project is a failure." Little incentive for people to adopt it. Then again, most OSS is ordered by the mania of its programmer-creators, not product or engineering quality informed by telemetry. Maybe in a Darwinian way, we only have the OSS that can thrive without telemetry and reactive product and engineering decisions.
Debian, RedHat and Firefox do have telemetry. Not a lot, but enough to prove it is possible to do it in a way that doesn't piss most people off.
Naturally the way they collect it is open source, it's largely de-identified, and because it open source you verify it's de-identified enough for you. And if it isn't you can turn it off.
So it's possible to do well, where "well" means gets you the data you without pissing off the users. Most proprietary don't bother to do it well for whatever reason.
But they should be careful: it a big factor in diving things like Home Assistant, Linux Desktop and now this, apparently.
Because it's never just for engineering purposes anymore. Marketing departments are CRAZY about consumer data of any kind and you can be assured the value of such collected data will be maximized on the information exchange market. Advertisers pay good money for trustable data streams, which can be essentially de-anonymized with enough cross correlation between sources. A GPU happens to be a good place to capture user activity, since from the screen buffer you can tell what website they visit, what videos they watch and what games they play.
This stuff is rather hush hush not to scare people but it's also well documented and certainly not an unfounded conspiracy. Its also why Europeans have adopted far reaching privacy laws (GDPR), their industries don't rely on consumer surveillance the way Americans have developed for the last decades.
The telemetry aspect is easily resolved by any enterprise that cares enough about it.
The conclusion that this is more to do with the overall architecture and deployment of the existing driver is much more plausible.
> The telemetry aspect is easily resolved by any enterprise that cares enough about it.
Yeah, this is how, by writing your own driver. NVIDIA sells you turnkey DGX machines. It doesn't give you firmware. You have to be Internet-connected to refresh your various licenses, at some point, which is the moment the telemetry is shared. Google "NVIDIA telemetry."
If you are using NVIDIA on the cloud, well all bets are off. You are using their drivers. Amazon can't force you, in your VM, to install a different driver for the GPU you are using - there's no alternative to the proprietary one. Hence, my theory for why Red Hat could be paid to do this.
Telemetry is something that NVIDIA doesn't budge on for enterprises. You're welcome to see for yourself and start a sales call. I hear they're pretty busy.
> but that raises the question of why Red Hat are writing a new driver
Nvidia's driver cannot be included in the upstream linux as it doesn't follow kernel coding style and code organization, but more importantly it is tightly coupled to a single version of their GSP firmware - they have to be updated at the same time.
It is weird for a 3rd party to be maintaining a 2nd driver when the first party has a reasonable OSS driver.