Hacker Newsnew | past | comments | ask | show | jobs | submit | robbru's commentslogin

Solid snake approved.


Excited to try this out, thanks for sharing.


I've been using the Google Gemma QAT models in 4B, 12B, and 27B with LM Studio with my M1 Max. https://huggingface.co/lmstudio-community/gemma-3-12B-it-qat...


This is the message that got me with 4o! "It won't take long about 3 minutes. I'll update you when ready"


I think people are sleeping on MLX and directing their attention to the "Apple Intelligence" marketing atm.


Interesting benchmarks, thanks for sharing!

If you're optimizing for lower power draw + higher throughput on Mac (especially in MLX), definitely keep an eye on the Desloth LLMs that are starting to appear.

Desloth models are basically aggressively distilled and QAT-optimized versions of larger instruction models (think: 7B → 1.3B or 2B) designed specifically for high tokens/sec at minimal VRAM. They're tiny but surprisingly capable for structured outputs, fast completions, and lightweight agent pipelines.

I'm seeing Desloth-tier models consistently hit >50 tok/sec on M1/M2 hardware without needing active cooling ramps, especially when combined with low-bit quant like Q4_K_M or Q5_0.

If you care about runtime efficiency per watt + low-latency inference (vs. maximum capability), these newer Desloth styled architectures are going to be a serious unlock.


TinyLLM is very cool to see! I will def tinker with it. I've been using MLX format for local LLMs as of late. Kinda amazing to see these models become cheaper and faster. Check out the MLX community on HuggingFace. https://huggingface.co/mlx-community


Great recommendation about the community

Any other resources like that you could share?

Also, what kind of models do you run with mlx and what do you use them for?

Lately I’ve been pretty happy with gemma3:12b for a wide range of things (generating stories, some light coding, image recognition). Sometimes I’ve been surprised by qwen2.5-coder:32b. And I’m really impressed by the speed and versatility, at such tiny size, of qwen2.5:0.5b (playing with fine tuning it to see if I can get it to generate some decent conversations roleplaying as a character)


I've shared a bunch of notes on MLX over the past year, many of them with snippets of code I've used to try out models: https://simonwillison.net/tags/mlx/

I mainly use MLX for LLMs (with https://github.com/ml-explore/mlx-lm and my own https://github.com/simonw/llm-mlx which wraps that), vision LLMs (via https://github.com/Blaizzy/mlx-vlm) and running Whisper (https://github.com/ml-explore/mlx-examples/tree/main/whisper)

I haven't tried mlx-audio yet (which can synthesize speech) but it looks interesting too: https://github.com/Blaizzy/mlx-audio

The two best people to follow for MLX stuff are Apple's Awni Hannun - https://twitter.com/awnihannun and https://github.com/awni - and community member Prince Canuma who's responsible for both mlx-vlm and mlx-audio: https://twitter.com/Prince_Canuma and https://github.com/Blaizzy


Very cool insight, Simonw! I will check out the audio mlx stuff soon. I think that is kinda new still. Prince Canuma is the GOAT.


Amazing. Thank you for the great resources!


Hey Nico,

Very cool to hear your perspective in how you are using the small LLMs! I’ve been experimenting extensively with local LLM stacks on:

• M1 Max (MLX native)

• LM Studio (GLM, MLX, GGUFs)

• Llama.cp (GGUFs)

• n8n for orchestration + automation (multi-stage LLM workflows)

My emerging use cases: -Rapid narration scripting -Roleplay agents with embedded prompt personas -Reviewing image/video attachments + structuring copy for clarity -Local RAG and eval pipelines

My current lineup of small LLMs (this changes every month depending on what is updated):

MLX-native models (mlx-community):

-Qwen2.5-VL-7B-Instruct-bf16 → excellent VQA and instruction following

-InternVL3-8B-3bit → fast, memory-light, solid for doc summarization

-GLM-Z1-9B-bf16 → reliable multilingual output + inference density

GGUF via LM Studio / llama.cpp:

-Gemma-3-12B-it-qat → well-aligned, solid for RP dialogue

-Qwen2.5-0.5B-MLX-4bit → blazing fast; chaining 2+ agents at once

-GLM-4-32B-0414-8bit (Cobra4687) → great for iterative copy drafts

Emerging / niche models tested:

MedFound-7B-GGUF → early tests for narrative medicine tasks

X-Ray_Alpha-mlx-8Bit → experimental story/dialogue hybrid

llama-3.2-3B-storyteller-Q4_K_M → small, quick, capable of structured hooks

PersonalityParty_saiga_fp32-i1 → RP grounding experiments (still rough)

I test most new LLMs on release. QAT models in particular are showing promise, balancing speed + fidelity for chained inference. The meta-trend: models are getting better, smaller, faster, especially for edge workflows.

Happy to swap notes if others are mixing MLX, GGUF, and RAG in low-latency pipelines.


Impressive! Thank you for the amazing notes, I have a lot to learn and test


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: