I’m guessing one of the big Chinese firms will outbid everyone. They’re obviously trying to catch up with OpenAI and throwing whatever money necessary at this makes sense as a result
That’s ok. It’s the expected result in a market economy.
America is throwing out their chances of making open source LLMs. Copyright holders are demanding license fees for training. That’s analogous to someone demanding licensing fees before you make a YouTube video. No one would be able to do it for free. Whereas it was completely possible to train a high grade LLM for free (clusters are surprisingly accessible for researchers) and it wasn’t till recently that you had to worry about being sued for it.
Net result: open source LLMs die, except for companies that can open source their smaller (lamer) models as an upsell for the real ones that anybody cares about. That’s not a world where open source makes a big impact. That’s a world where (metaphorically) GPL software is subservient to business interests for the rest of eternity. Say what you will about whether that’s true, but no business has influence over Emacs, and it’s fantastic, powerful software. No one will be able to make the equivalent open source fantastic LLM in America at this rate.
A bunch of factors. One is that you’d have to keep your identity private, but credibility is how you get access to resources. And resources are necessary to train anything.
Take TRC for example. They give people access to TPUs in exchange for being cited. But if they were cited as facilitating large scale piracy, they probably wouldn’t be happy. It could even lead to a lawsuit on the grounds of facilitating copyright infringement, which will likely be the charge against me if someone gets mad enough to sue me directly. (I never distributed anything, but that doesn’t matter if they can prove facilitation.) And TRC is an even juicier target for lawsuits since Google is a giant loot box of money for them.
The coordination problems faced by all illegal entities.
Piracy generally means taking a full fledged product of someone elses and using it. Seemingly in your description, you're taking the data illegally and then doing all the compute on it. The amount of compute needed in this case is staggering, so you are back to coordinating with other people. Any one of those individuals turning against the group would likely doom the group, hence it's a high risk operation.
This is something I've been thinking about: What kind of compute would it take to train an LLLM on let's say a torrent with 100GB of books from Anna's Archive?
i'm not sure i follow. you're saying that no company (except the small ones) will be able to open source a LLM. but then you cite software created outside of the traditional company structure as an example of what we'll never have in the LLM space. doesn't your example negate the premise?
Not at all. Maybe it’s lost to time, but most of the important models were created by academics, not companies, up till recently. GPT-J for example was trained by one person acting alone (Ben Wang).
I fine tuned GPT 1.5B on chess games. AI Dungeon fine tuned on fantasy novels. All of this type of work will become impossible with the specter of lawsuits hovering over their heads.
EDIT: also most impactful older models were by one person (e.g. Yolo). What I like about ML is that lone wolves can have a big impact.