Using M1/2/3 Max for LLM inference isn't at all "speculative", it's a thing today and high end Apple Silicon being an option for LLM inference is becoming general knowledge among the local inference community. The original author of llama.cpp (one of the leading LLM inference projects) developed it on a Mac and it has full Metal acceleration support.
The $20/month subscription is going to give you access to commercial models, but generally you have to run the open weight models yourself. With the unified RAM you can trivially run the larger 70B+ models.
AI researchers generally have to use CUDA due to how the ecosystem is still mostly CUDA-only for training and fine tuning, but those who need to occasionally use custom/local models for inference will likely find high end Macs being a good fit for their use cases.
Okay, but this is rather low-level. Most users aren’t going to care about “running an LLM.” They want to ask an AI chatbot questions or get code autocompletion or something like that. What applications are there that need a local LLM and who is using them? What do they do with them?
Personally, I’m reasonably happy with GPT4 and Github Copilot, and I’ve sometimes used Midjourney, though I cancelled my subscription since I’m not currently generating any images. Are there important apps that I’m missing?
If you're happy with the commercial offerings there isn't a very compelling reason to use local models. The local ones are not "better" in general. But sometimes people have reasons to use local models, eg. privacy, customization, control, etc.
I personally use it mostly to keep tabs on the latest models released on huggingface. There has been a lot of interesting developments since last year, and models have become more and more powerful.
The $20/month subscription is going to give you access to commercial models, but generally you have to run the open weight models yourself. With the unified RAM you can trivially run the larger 70B+ models.
AI researchers generally have to use CUDA due to how the ecosystem is still mostly CUDA-only for training and fine tuning, but those who need to occasionally use custom/local models for inference will likely find high end Macs being a good fit for their use cases.