Open source AI: Red Hat's point-of-view

nine_k · 2025-02-07T03:12:44 1738897964

To me, the ML situation looks roughly like this.

(1) Model weights are something like a bytecode blob. You can run it in a conformant interpreter, and be able to do inference.

(2) Things like llama.cpp are the "bytecode interpreter" part, something that can load the weights and run inference.

(3) The training setup is like a custom "compiler" which turns training data to the "bytecode" of the model weights.

(4) The actual training data is like the "source code" for the model, the input of the training "compiler".

Currently (2) is well-served by a number of open-source offerings. (1) is what is usually released when a new model is released. (1) + (2) give the ability to run inference independently.

AFAICT, Red Hat suggests that an "open-source ML model" must include (1), (2), and (3), so that the way the model has been trained is also open and reusable. I would say that it's great for scientific / applied progress, but I don't think it's "open source" proper. You get a binary blob and a compiler that can produce it and patch it, but you can't reproduce it the way the authors did.

Releasing the training set, the (4), to my mind, would be crucial for the model to be actually "open source" in the way an open-source C program is.

I understand that the training set is massive, may contain a lot of data that can't be easily released publicly but that were licensed for the training purposes, and that training from scratch may cost millions, so releasing the (4) is very often infeasible.

I still think than (1) + (2) + (3) should not be called "open-source", because the source is not open. We need a different term, like "open structure" or something. It's definitely more open than something that's only available via an API, or as just weights, but not completely open.

pavelstoev · 2025-02-07T03:41:14 1738899674

It is really just “open use” with detailed defined by the license type (MIT, etc)

nine_k · 2025-02-07T03:49:28 1738900168

It's more than just use (inference), it does open some otherwise secret sauce of the training. It looks like there's no existing word / notion to exactly pinpoint this level of openness.

visarga · 2025-02-07T05:06:47 1738904807

> Model weights are something like a bytecode blob

Can you update a bytecode blob as easily as finetuning and prompting models? It only takes a few input-output pairs and a few dollars worth of compute. They are more like an operating system and fine-tuning/prompting is like scripting on top. Similarly with Linux, you can download a LLM and run it locally.

anon373839 · 2025-02-07T05:06:31 1738904791

I think these endless debates about whether open-weights models qualify for a particular piece of terminology are... tiring. That said, I think the debates would benefit from discussing model training and model inference as two separate systems, because that's what they are. It's possible for model training to be closed-source while model inference is open-source, and vice versa.

Consider recent Mistral-Small release. The model training is almost totally closed-source. You can't replicate it. However, the model inference is fully open source: the code and weights are Apache licensed. Not only that, but Mistral released both the base model and the instruction-tuned model, so you have a good foundation to work from (the base model) should you prefer to do your own instruction tuning. In fact, Mistral has also open-sourced code to aid in the fine-tuning process as well. So you really have everything you need* to use and customize this inference system. And for most practical purposes, even if you had the original training data, it would be of no use to you.

It's also worth considering the inverse scenario. Suppose Meta were to release a big blob of pre-training data and scripts for Llama 405B, but no weights. This clearly qualifies as open source, but it is basically useless unless you have many millions of dollars to do something with it. It would do very little to democratize access to AI.

* Asterisk: There is one situation where having access to the original training data would be really, really useful -- model distillation. Nobody can match Meta's ability to distill Llama 405B into an 8B size, because that process works best when you can do it on identically distributed data.

pabs3 · 2025-02-07T05:39:01 1738906741

For me, the attacks on ML that are possibly by poisoning the training data preclude considering models without freely distributable and modifiable training as open-source or libre models.

pabs3 · 2025-02-07T02:48:25 1738896505

I prefer the ML policy of the Debian Deep Learning Team.

https://salsa.debian.org/deeplearning-team/ml-policy/

jmclnx · 2025-02-07T00:07:33 1738886853

No real information, just a marketing spiel.

echelon · 2025-02-07T02:24:21 1738895061

> TL;DR - Red Hat views the minimum criteria for open source AI as open source-licensed model weights combined with open source software components.

That's not open source!

I'm going to call this "Wizard of Ozzing". You give away the spectacle of magic tricks, but none of the science and machinery to do it yourself. You're still hiding it all behind fake virtue signalling.

Open source in AI is open weights, open training scripts, open inference scripts, open training datasets, and associated helper utilities. Without the science lab, you cannot replicate the science.

Weights in a vacuum are not open. It's a trick.

tbrownaw · 2025-02-07T02:32:44 1738895564

You have the preferred form for making modifications, and the relevant permissions to now get in trouble for it.

If you actually go look at the open source or few software definitions, that's what they're about - being able to make modifications.

Just like an open source software project doesn't need a public record of the rationale for all architectural decisions in order to qualify.

nine_k · 2025-02-07T02:48:08 1738896488

Open source means that you can build from said source.

You can e.g. give away a closed-source game engine with an editor, where you can modify the prebuilt levels, and create your own, to your heart's content. But you can't build it from scratch in a controlled environment, and can't audit it. You also can't do modifications where the level editor interface is not sufficient, e.g. in the renderer. That's not open source, that's freeware.

For ML models, the training set is a crucial part of their source.

superice · 2025-02-07T04:00:09 1738900809

Open source means being able to verify how the sausage is made. Getting a premade sausage and saying “oh you can still eat it and spice it up however you like” isn’t that.

I’m happy open weights exist, but it is not truly open source.

echelon · 2025-02-07T04:54:45 1738904085

Define these terms:

- Pretrain

- Fine tune

- Catastrophic forgetting

- Learning rate

- Adapter layer

If you understand these terms, then you understand why "open weights" are not the same as "open source".

"Open weights" are obfuscated and minified freeware.

TZubiri · 2025-02-07T02:47:29 1738896449

Red Hat is on your team and you are criticizing them for not doing enough, why not side with red hat against companies that don't even publish shit and still consider there models to be open?

The prevailing strategy for most companies is to publish some bullshit (like a cli or a model downloader) and call it open source, the bar is waaay low and we are just trying to raise it a couple of inches, it's not helpful to go for the throat and demand that it be raised to the sky!

echelon · 2025-02-07T03:03:34 1738897414

The term "open" in AI is being muddied and redefined. We shouldn't stand for that in any of its manifestations, lest we find ourselves in a world where "open" means completely dependent upon the giant foundation model companies.

This definition of "open source" for AI must not be allowed to take hold. It's pernicious.

Weights are a compiled binary. Encumbered freeware.

ysofunny · 2025-02-07T00:20:11 1738887611

i mean... it's IBM so what did we really expect?

globalnode · 2025-02-07T01:24:27 1738891467

I was thinking of trying Fedora (currently using Debian) and this comment made me look up who owns red-hat. ibm now owns red-hat, and apparently vanguard owns a huge chunk of ibm. I wonder how much influence any of the sponsors have over what goes into the os and what direction it takes.

rat87 · 2025-02-07T01:33:44 1738892024

Vanguard "owns" a huge chunk of everything. Vanguard runs index funds, most peoples retirements are vanguard buying small shares of an index fund representing all the big companies on the stock exchange

TZubiri · 2025-02-07T02:49:06 1738896546

Also they simultaneously hold the ownership rights as well as equivalent owenrship liabilities, so they own shit squat in net terms (excpet maybe their management fees).

mattkrause · 2025-02-07T03:05:47 1738897547

It’s even weirder than that!

Vanguard has an odd corporate structure where it’s owned by the funds that it manages, so it’s effectively a co-op owned by its customers.

UniverseHacker · 2025-02-07T04:04:05 1738901045

I don’t understand why customer owned co-ops aren’t ubiquitous. Vanguard is amazing- low fees, and great services- they beat all of the competition. I had to call their support line today and it was the most professional customer service I’ve ever experienced.

TZubiri · 2025-02-07T21:56:22 1738965382

Interesting, that's inline with Bogle's mission for low (0%) mgmt fees.

I think it probably doesn't apply to other majority holders like BlackRock though.

fny · 2025-02-07T04:08:54 1738901334

Vanguard is everyone’s retirement account. Everytime someone buys their S&P 500 ETF, they go buy stocks.

TZubiri · 2025-02-07T02:40:30 1738896030

The vanguard thing is a common misconception, I've worked for S&P and we did tracking of ownership.

Think of these huge funds as proxies. It's like someone with little finance trading saying most of the internet is owned by cloudflare or RIPE or ARIN.

lrvick · 2025-02-07T04:25:30 1738902330

Fedora is centrally built/signed and not part of the reproducible builds project. It should not be used for any systems you need to be able to trust.

You are much better off sticking with Debian anyway, or looking at Guix for a significant improvement.

dismalaf · 2025-02-07T05:12:50 1738905170

> vanguard owns a huge chunk of ibm

Vanguard and other large institutions own a huge chunk of everything because most investors don't buy stocks directly, they buy them through mutual funds, ETFs, etc...

hedora · 2025-02-07T02:20:18 1738894818

You probably missed the news that RedHat decided to systematically violate the GPL:

https://sfconservancy.org/blog/2023/jun/23/rhel-gpl-analysis...

tl;dr: They no longer publish source packages except to paying customers. If the paying customers republish the source, then RedHat closes the account.

I'm surprised none of the upstream developers have sued them for violating the license. There's no way I'd trust an organization that was behaving so unethically with control over my machine's package manager.

Anyway, I've been pretty happy with Devuan (Debian without SystemD). I find it much more stable than Ubuntu, Arch, Debian, etc., and all the userspace stuff I care about works great.

dralley · 2025-02-07T02:55:20 1738896920

This is wrong. Everything in RHEL is downstream from CentOS Stream - all of the sources are published there. The only differences are a handful of trademarks.

>If the paying customers republish the source, then RedHat closes the account.

Even if you ignore the above and think only about the official sources provided direct from the customer portal, it's still not a violation IMO.

Because that's not a restriction on how you can use the software you've been provided, it's a restriction on which services you can expect Red Hat to continue providing you, i.e. providing new software in the form of updates. The software you have already been provided continues functioning, it's not like the system gets bricked if your account is closed. GPL only specifies "what are you allowed you do with this piece of software you have been provided with", it doesn't guarantee a future relationship between the provider and receiver.

At worst it's a murky area, not a "systematic violation" as you claimed.

Also, like, it's a hypothetical thing the user agreement claims could be done, not something that necessarily is done. I don't think there has ever been an actual demonstrated instance of an account being closed because of that.

pabs3 · 2025-02-07T03:01:11 1738897271

> Everything in RHEL is downstream from CentOS Stream - all of the sources are published there.

IIRC thats incorrect, RHEL gets some fixes before CentOS Stream.

dralley · 2025-02-07T03:03:29 1738897409

They may get published to RHEL first in the case of embargo'd security fixes (and not by long), but the point is that the sources are still published to CentOS Stream.

tbrownaw · 2025-02-07T02:39:36 1738895976

> I'm surprised none of the upstream developers have sued them for violating the license.

Sounds like that implies that they probably actually aren't.

globalnode · 2025-02-07T02:38:46 1738895926

Cheers, will check it out.

TZubiri · 2025-02-07T02:38:45 1738895925

Or you don't understand the matter being discussed.

The classical definition of open source (cited in this release as Stallman's GPL definition of the preferred way in which to modify code) kind of breaks for ML programs.

This is a good update on the definition of open source from a quite reputable and influential FOSS source.

Your presumption that this article isn't sufficiently pure because in addition to being a company that does Free Software, it needs to do so without doing anything that resembles an ad or in any way ensures profitability is pedantic. If we had it your way the only open source software we would have would be mom's basement projects with shoestring budgets.

lutusp · 2025-02-07T03:44:52 1738899892

> More than three decades ago, Red Hat saw the potential of how open source development and licenses can create better software to fuel IT innovation. Thirty-million lines of code later, Linux not only developed to become the most successful open source software but the most successful software to date.

This seems to conflate Red Hat and Linux, as well as try to equate Red Hat with open-source. Red Hat is Linux, but Linux is not Red Hat, especially now that Red Hat has decided to restrict access to the RHEL source (https://www.itworldcanada.com/article/red-hat-decision-turns...).

And a pet grammatical peeve of mine:

> ... in some respects they serve a similar function to code.

I see this everywhere now -- IMHO it should be "... serve a function similar to code." Doesn't the original grate on your ear?

Also this is a Turing-test bot detector -- bots don't use this weird grammatical construction, only humans do.

mogwire · 2025-02-09T18:24:31 1739125471

Restrict access to paying customers? Restrict access to companies violating an EULA to not redistribute packages?

That fact that people continue to spread this trope is amazing.

I pay for RHEL and I have a developer subscription for personal usage and the SRPMs are right there on their download portal.

Just because CIQ err Rocky has to take extra steps and violate Red Hat’s EULA doesn’t mean they restricted access.

nickandbro · 2025-02-07T01:58:04 1738893484

Congrats NeuralMagic team on being acquired! I don't know if you know this, but I worked with you on discord a few times. Your team's always willing to go above and beyond with pushing out popular models in specific quant formats compatible with vLLM. And one of the few huggingface orgs that my boss can actually trust. Well deserved!

blackeyeblitzar · 2025-02-03T22:48:01 1738622881

Disappointing that red hat is basically validating open weights as open source, and excusing it by saying this:

> The majority of improvements and enhancements to AI models now taking place in the community do not involve access to or manipulation of the original training data. Rather, they are the result of modifications to model weights or a process of fine tuning which can also serve to adjust model performance.

Well yes, because they have no access to anything more. With training source code and data they might do something different. If you don’t have all the things used to produce the final result, it’s not open source.

bberenberg · 2025-02-06T23:38:07 1738885087

Do you believe that open source can exist on top of closed hardware? I ask because you can't produce the final result without having someone give you the firmware blob. To me, this seems like an analogue to building on top of open weight models.

PollardsRho · 2025-02-07T00:17:56 1738887476

The math underpinning an AI model exists independent of the hardware it's realized on. I can train a model on one GPU and someone else can replicate my results with a different GPU running different drivers, down to small numerical differences that should hopefully not have major effects.

Data isn't fungible in the same way: I can't just replace one dataset with another for research where the data generation and curation is the primary novel contribution and expect to replicate the results.

There's also a larger accountability picture: just like scientific papers that don't publish data are inherently harder to check for statistical errors or outright fraud, there's a lot of uncomfortable trust required for open-weight closed-data models. How much contamination is there for the major AI benchmarks? How much copyrighted data was used? How can we be sure that the training process was conducted as the authors say, whether from malfeasance or simple mistakes?

twelve40 · 2025-02-06T23:54:35 1738886075

i have very little knowledge of any of this, but i had an impression that OpenAI was trained on commodity cloud hardware that's available for purchase/rent to anyone, including off-the-shelf GPUs from Nvidia and AMD? are those what you are referring to as "the firmware blob", or was there some other, more specialized and custom-built closed hardware involved?

jlouis · 2025-02-06T23:49:54 1738885794

Turing completeness makes it a different problem.

TZubiri · 2025-02-07T02:50:34 1738896634

"Do you believe that open source can exist on top of closed hardware? "

Yes, if Hardware is developed against standards shared by multiple manufacturers like amd64

dralley · 2025-02-07T00:35:46 1738888546

It's not exactly practical to hand out the training material given the sheer quantity of data we're talking about.

andrewf · 2025-02-07T03:12:27 1738897947

GPL v2 and earlier let you charge distribution costs (v3's language is more complicated). In the late 80s you could order an Emacs tape from the FSF for $150, which is about $430 today!

philipkglass · 2025-02-07T00:45:24 1738889124

But they could provide training code and let people provide their own Common Crawl (or whatever other pile of training data), couldn't they?

stonogo · 2025-02-07T00:41:04 1738888864

Yeah, no. We can move an arbitrary amount of data around the world at breakneck speed. Netflix does this for a living. It's not practical to hand out the training material because of the massive rampant copyright violations.

mossTechnician · 2025-02-07T01:14:30 1738890870

If a research group downloads material in order to train a model, is there some significant difference in copyright violation if they hand it to a second research group in order to fulfill the same purposes?

Onawa · 2025-02-07T01:21:42 1738891302

Yes, because of a key word in a lot of copyright laws... "distribution". Using that copyrighted material themselves to train the model still gives them plausible deniability. Handing the copyrighted material to another group starts to run afoul of other laws and also removes the plausible deniability that the original group can claim regarding their training data.

TZubiri · 2025-02-07T02:44:38 1738896278

The training data is not necessarily kept. It's possible that data is consumed, incorporated into the weights and then discarded.

llm_trw · 2025-02-07T01:03:12 1738890192

If only we'd figured out a technology that let us move huge torrents of bits around.

If only there was a catchy name for it. Something like bit-torrent perhaps?

TZubiri · 2025-02-07T02:43:38 1738896218

I understood that it meant source code in addition to weights, as in "publishing programming language code does not suffice as open source if you do not publish weights"

thrqka · 2025-02-06T23:49:29 1738885769

> We believe that these concepts can have the same impact on artificial intelligence

Where the concept is the exploitation of thousands of volunteers while repackaging their work. (I know that RedHat sponsors some people, sometimes to the detriment of projects, but a lot of it is not sponsored, especially when RedHat established itself.)

carlwgeorge · 2025-02-07T04:53:59 1738904039

Red Hat pays more people to work on open source than any other company I'm aware of. I am one of these people. I challenge you to find a single open source project included in a Red Hat product that doesn't contain contributions from Red Hat employees. Maybe a few exist, but the vast majority include Red Hat contributions, because we contribute all over the open source ecosystem.

dralley · 2025-02-07T00:34:04 1738888444

So, what, is HN against open source now?

7qW24A · 2025-02-07T00:41:31 1738888891

More accurate to say that VC and the “startup ideology” has always been at the core of HN - it just so happened that aligned with OSS ideology during the ZIRP era.

lrvick · 2025-02-07T04:30:02 1738902602

I largely agree with these points, however it is an awkward position coming from Red Hat which is the best funded Linux distribution there is, and -still- not part of the reproducible builds project or investing in full source bootstrapping which means no one can exactly reproduce their published artifacts from source or prove they were not tampered with. (Same with Fedora)

Glass houses.

dralley · 2025-02-07T04:36:20 1738902980

> (Same with Fedora)

??

https://docs.fedoraproject.org/en-US/reproducible-builds/

https://pagure.io/fedora-reproducible-builds/project/issues

https://fedoraproject.org/wiki/Releases/41/ChangeSet#Reprodu...

lrvick · 2025-02-07T07:41:28 1738914088

From that first link "In the Fedora ecosystem, we cannot achieve reproducibility by the reproducible-builds.org definition"

Good to see they are slowly closing some blockers every year or so, but fundamentally today they do builds and signing centrally. There is no way to readily get the same hash of a central fedora supplied rpm locally.

Supply chain integrity is simply not a priority. They just trust the central build farm, or the compilers it uses, or everyone with access to it will never be compromised.

dralley · 2025-02-07T14:05:55 1738937155

This is a touch dramatic. The hash of the payload and the hash of the RPM header are still reproducible and can be verified. It's just that the existence of internal signatures makes it impossible to do a simple checksum of the file.

lrvick · 2025-02-07T15:17:28 1738941448

And thus RPM was not designed with easy user reproduction and signing by multiple independent parties for high accountability in mind. Most other package managers do not have this problem. This is a flaw that should be corrected.

Also, it takes a ton of work and testing and bug fixes and patches to get software reproducible. Assume most packages are not reproducible until proven otherwise. Arch, debian, nix, guix, all do that work and publish the proof, for several years, with far less resources than redhat or fedora. Stagex even has 100% (shameless plug)

Easy user hash for hash reproducibility with published reproduction testing proofs is the standard baseline for years now, and even that is nowhere near good enough.

Multiple independently signed reproduction proofs with full source bootstrapping is IMO a bare minimum for any distro that expects other people to be able to trust it for more than hobby use cases.

Supply chain attacks are becoming very common, and no one should have to trust a single engineer somewhere with a god signing key for a major distro.

Also just to spot check a popular package in Fedora, rust, I just confirmed it still downloads a non-reproducible binary rust compiler to build its own rust package, so it is certainly not reproducible from source even putting aside the rpm signing format problems. Fedora blindly trusts whoever builds the binaries on the rust team. I can only assume RHEL does the same.

https://src.fedoraproject.org/rpms/rust/blob/8e04e725bbf4eb9...

dralley · 2025-02-07T18:47:19 1738954039

It's a problem that can be easily fixed with tooling that's smart enough to just look inside the file. Detached signatures aren't necessarily better, just different.

davydm · 2025-02-07T04:27:30 1738902450

Red Hat opining about what is and isn't open is absolutely hilarious. Sorry, my dudes, you launched that ship in the wrong direction ages ago.

worthless-trash · 2025-02-07T04:29:02 1738902542

What does redhat ship that isnt open source under the licenses my dude ?

sciencesama · 2025-02-07T03:02:59 1738897379

Look whos talking !! The whole fiasco with fedora and now they come to talk about opensource !!