There are many workflows, with hardware-dependent requirements. Three which work for my MacBook:
1. Clone & make llama.cpp. It's a CLI program that runs models, e.g. `./main -m <local-model-file.gguf> -p <prompt>`.
2. Another CLI option is `ollama`, which I believe can download/cache models for you.
3. A GUI like LM Studio provides a wonderful interface for configuring, and interacting with, your models. LM Studio also provides a model catalog for you to pick from.
Assuming that your hardware is sufficient, options 1 & 2 should satisfy your terminal needs. Option 3 is an excellent playground for trying new models/configurations/etc.
Models are heavy. To fit one in your silicon and run it quickly, you'll want to use a quantized model. It's a model's "distilled" version -- say 80% smaller for a 0.1% accuracy loss. TheBloke on HuggingFace is one specialist in distilling. After finding a model you like, you can download some flavor of quantization he made, e.g: `huggingface-cli download TheBloke/neural-chat-7B-v3-3-GGUF neural-chat-7b-v3-3.Q4_K_M.gguf --local-dir .`; then use your favorite model runner (e.g. llama.cpp) to run it.
1. Clone & make llama.cpp. It's a CLI program that runs models, e.g. `./main -m <local-model-file.gguf> -p <prompt>`.
2. Another CLI option is `ollama`, which I believe can download/cache models for you.
3. A GUI like LM Studio provides a wonderful interface for configuring, and interacting with, your models. LM Studio also provides a model catalog for you to pick from.
Assuming that your hardware is sufficient, options 1 & 2 should satisfy your terminal needs. Option 3 is an excellent playground for trying new models/configurations/etc.
Models are heavy. To fit one in your silicon and run it quickly, you'll want to use a quantized model. It's a model's "distilled" version -- say 80% smaller for a 0.1% accuracy loss. TheBloke on HuggingFace is one specialist in distilling. After finding a model you like, you can download some flavor of quantization he made, e.g: `huggingface-cli download TheBloke/neural-chat-7B-v3-3-GGUF neural-chat-7b-v3-3.Q4_K_M.gguf --local-dir .`; then use your favorite model runner (e.g. llama.cpp) to run it.
Hope that gets you started. Cheers!