Hacker Newsnew | past | comments | ask | show | jobs | submit | thecopy's commentslogin

Stupid question: can i run this on my 64GB/1TB mac somehow easily? Or this requires custom coding? 4bit is ~200GB

EDIT: found this in the replies: https://github.com/Anemll/flash-moe/tree/iOS-App


Running larger-than-RAM LLMs is an interesting trick, but it's not practical. The output would be extremely slow and your computer would be burning a lot of power to get there. The heavy quantizations and other tricks (like reducing the number of active experts) used in these demos severely degrade the quality.

With 64GB of RAM you should look into Qwen3.5-27B or Qwen3.5-35B-A3B. I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.


>I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.

There are dynamic quants such as Unsloth which quantize only certain layers to Q4. Some layers are more sensitive to quantization than others. Smaller models are more sensitive to quantization than the larger ones. There are also different quantization algorithms, with different levels of degradation. So I think it's somewhat wrong to put "Q4" under one umbrella. It all depends.


I should clarify that I'm referring generically to the types of quantizations used in local LLM inference, including those from Unsloth.

Nobody actually quantizes every layer to Q4 in a Q4 quant.


I've tried a number of experiments, and agree completely. If it doesn't fit in RAM, it's so slow as to be impractical and almost useless. If you're running things overnight, then maybe, but expect to wait a very long time for any answers.

Current local-AI frameworks do a bad job of supporting the doesn't-fit-in-RAM case, though. Especially when running combined CPU+GPU inference. If you aren't very careful about how you run these experiments, the framework loads all weights from disk into RAM only for the OS to swap them all out (instead of mmap-ing the weights in from an existing file, or doing something morally equivalent as with the original MacBook Pro experiment) which is quite wasteful!

This approach also makes less sense for discrete GPUs where VRAM is quite fast but scarce, and the GPU's PCIe link is a key bottleneck. I suppose it starts to make sense again once you're running the expert layers with CPU+RAM.


Yes, SSD speed is critical though. The repo has macOS builds for CLI and Desktop. It's early stages though. M4 Max gets 10-15 TPS on 400B depending on quantization. Compute is an issue too; a lot of code is PoC level.

I have a 64G/1T Studio with an M1 Ultra. You can probably run this model to say you’ve done it but it wouldn’t be very practical.

Also I wouldn’t trust 3-bit quantization for anything real. I run a 5-bit qwen3.5-35b-A3B MoE model on my studio for coding tasks and even the 4-bit quant was more flaky (hallucinations, and sometimes it would think about running tools calls and just not run them, lol).

If you decided to give it a go make sure to use the MLX over the GGUF version! You’ll get a bit more speed out of it.


Looks interesting. But how to explore or test or use? The product page (https://mistral.ai/products/forge) also does not contain anything useful. Just "Contact us"

Dissapointing.


Shameless plug: im working on a product that aims to solve this: https://www.gatana.ai/

Who isn't?

Building Gatana, a platform for securely connecting an organizations agents to their services, with very flexible credential management and federated IDP trust.

Currently my mini-projects includes:

* 0% USA dependency, aim is 100% EU. Currently still using AWS SES for email-sending and GCP KMS for customer data key encryption for envelope encryption.

* Tool output compression, inspired by https://news.ycombinator.com/item?id=47193064 Added semantic search on top of this using a local model running on Hetzner. Next phase is making the entire chain envelop encrypted.

* "Firewall" for tool calls

* AI Sandboxes ("OpenClaw but secure") with the credential integration mentiond above

https://www.gatana.ai/


I use Ergotron, super happy.


Air power alone has _never_ achieved regime change.


Libya begs to differ


What do you mean, Lybia happened 2 days after France met with Libyan rebels leaders and one of Ghadafi's son, the first strike targeted ground installations so that the rebels could take over.

It was carefully planed for a swift takeover, way, way more than what is happening there, and it still ended up being a cluster fuck. The rebels were the fucking ground groups.

Here, it will probably be Iraqis, like during the first gulf war. Hopefully less people will die, but clearly this is a terrible decision.


I implemented this as well successfully. Re structured data i transformed it from JSON into more "natural language". Also ended up using MiniLM-L6-v2. Will post GitHub link when i have packaged it independently (currently in main app code, want to extract into independent micro-service)

You wrote:

>A search for “review configuration” matches every JSON file with a review key.

Its good point, not sure how to de-rank the keys or to encode the "commonness" of those words


IDF handles most of it. In BM25, inverse document frequency naturally down-weights terms that appear in every document, so JSON keys like "id", "status", "type" that show up in every chunk get low IDF scores automatically. The rare, meaningful keys still rank.

For the remaining noise, I chunk the flattened key-paths separately from the values. The key-path goes into a metadata field that BM25 indexes but with lower weight. The value goes into the main content field. So a search for "review configuration" matches on the value side, not because "configuration" appeared as a JSON key in 500 files.

MiniLM-L6-v2 is solid. I went with Model2Vec (potion-base-8M) for the speed tradeoff. 50-500x faster on CPU, 89% of MiniLM quality on MTEB. For a microservice where you're embedding on every request, the latency difference matters more than the quality gap.


For me quitting preview, or maybe it is settings, resolves it.


Very interesting, one big wrinkle with OP:s approach is exactly that, the structured responses are un-touched, which many tools return. Solution in OP as i understand it is the "execute" method. However, im building an MCP gateway, and such sandboxed execution isnt available (...yet), so your approach to this sounds very clever. Ill spend this day trying that out


The LLM that wrote the comment you are replying to has no idea what it is talking about...


Im trying it anyway


commented below with more info in depth


Are you sure it's simply because YOU don't understand it? Because it seems to make sense to me after working on https://github.com/pmarreck/codescan


Im planning on getting the new M5 MBP i expect to be released next week. Is it possible to downgrade? I assume it comes with Tahoe :(


Typically no, Mac's don't expect to run versions of macOS before the one they were released with.


It’s not worth it, especially since the M6 MBP is rumored to already come out later this year (though likely with a price hike): https://9to5mac.com/2026/02/26/two-unique-new-macbook-pros-a...


Depends on what one is looking for. I'm considering upgrading to an M5 model because while the M6 redesign might come with some nicer specs, it's also going to be coming with some teething pains by virtue of having a new design. The M5 generation is probably going to be a speed bump with a chassis and screen that's a known quantity and has had the kinks smoothed out.


I'm also cautious of the redesign. But I got burned by the first Touchbar Macbook back in the day, so twice shy etc.


I skipped the touchbar/buttery era, but before that the GPU/ballgate disaster plagued the MacBookPro lineup — pushed me away from Apple hardware for over decade!

My vision has since gotten bad enough that I can't really use laptops/phones well, anymore — but recently got a 15" MacBookAir, M3 pre-Tahoe (for bedtime youtubies).

The hardware is exceptional, battery life even more so... and I'll never update the operating system.



From that page:

> As a rule of thumb, Macs will not run any version of macOS older than the one they shipped with when they launched. Apple provides security updates for older versions of macOS, but it doesn’t bother backporting drivers and other hardware support from newer versions to older ones.

So the answer is “no”, they probably won’t be able to downgrade on the models that are about to be released.


Why not buy a used M4 Pro/Max?


It's possible if you do a wipe and do a fresh install. You essentially boot into the Sequoia installer. I'm also looking at possibly picking up a M5 MBP and was the first things I looked into.


I bought a refurb m4 mac recently just to avoid tahoe slop ... worth considering, I think.


Almost certainly not :|


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: