24 gigabytes is more than enough to run a local LLM for a small household or business.
This is "gaming PC" territory, not "space heater". I mean people already have PS5's and whatnot in their homes.
The hundreds of gigabytes thing exists because the big cloud LLM providers went down the increasing parameter count path. That way is a dead end and we've reached negative returns already.
Prompt engineering + finetunes is the future, but you need developer brains for that, not TFLOPs.
It depends on 1) what model you are running; and 2) how many models you are running.
You can just about run a 32B (at Q4/Q5 quantization) on 24GB. Running anything higher (such as the increasingly common 70B models, or higher if you want to run something like Llama 4 or DeepSeek) means splitting the model between RAM and RAM. -- But yes, anything 24B or lower you can run comfortably, including enough capacity for the context.
If you have other models -- such as text-to-speech, speech recognition, etc. -- then those are going to take up VRAM for both the model and during processing/generation. That affects the size of LLM you can run.
This is "gaming PC" territory, not "space heater". I mean people already have PS5's and whatnot in their homes.
The hundreds of gigabytes thing exists because the big cloud LLM providers went down the increasing parameter count path. That way is a dead end and we've reached negative returns already.
Prompt engineering + finetunes is the future, but you need developer brains for that, not TFLOPs.