> Like there must be a size too small to hold "all the information" in.
We're already there. If you running a Mistral-Large-2411 and Mistral-Small-2409 locally, you'll find the larger model is able to recall more specific details about works of fiction. And Deepseek-R1 is aware of a lot more.
Then you ask one of the Qwen2.5 coding models, and they won't even be aware of it, because they're:
> small, specialized models.
> But maybe training compute will just get to the point where we can run a full-featured model on our desktop (or phone)?
Training time compute won't allow the model to do anything out of distribution. You can test this yourself if you run one of the "R1 Distill" models. Eg. If you run the Qwen R1 distill and ask it about niche fiction, no matter how long you let it <think> for, it can't tell you something the original Qwen didn't know.
I suppose we could eventually get to a super-MoE architecture. Models are limited to 4-16GB in size, but you could have hundreds of models on various topics. Load from storage to RAM and unload as needed. Should be able to load up any 4-16GB model in a few seconds. Maybe as well as a 4GB "Resident LLM" that is always ready to figure out which expert to load.
> We're already there. If you running a Mistral-Large-2411 and Mistral-Small-2409 locally, you'll find the larger model is able to recall more specific details about works of fiction.
Oh, for sure. I guess what I'm wondering is if we know the Small model (in this case) is too small -- or if we just haven't figured out how to train well enough?
Like, have we hit the limit already -- or, in (say) a year, will the Small model be able to recall everything the Big model does (say, as of today)?
We're already there. If you running a Mistral-Large-2411 and Mistral-Small-2409 locally, you'll find the larger model is able to recall more specific details about works of fiction. And Deepseek-R1 is aware of a lot more.
Then you ask one of the Qwen2.5 coding models, and they won't even be aware of it, because they're:
> small, specialized models.
> But maybe training compute will just get to the point where we can run a full-featured model on our desktop (or phone)?
Training time compute won't allow the model to do anything out of distribution. You can test this yourself if you run one of the "R1 Distill" models. Eg. If you run the Qwen R1 distill and ask it about niche fiction, no matter how long you let it <think> for, it can't tell you something the original Qwen didn't know.