I think the fact that, as far as I understand, it takes 40GB of VRAM to run, is ...

reissbaker · 2025-08-04T21:17:29 1754342249

40GB is small IMO: you can run it on a mid-tier Macbook Pro... or the smallest M3 Ultra Mac Studio! You don't need Nvidia if you're doing at-home inference, Nvidia only becomes economical at very high throughput: i.e. dedicated inference companies. Apple Silicon is much more cost effective for single-user for the small-to-medium-sized models. The M3 Ultra is ~roughly on par with a 4090 in terms of memory bandwidth, so it won't be much slower, although it won't match a 5090.

Also for a 20B model, you only really need 20GB of VRAM: FP8 is near-identical to FP16, it's only below FP8 that you start to see dramatic drop-offs in quality. So literally any Mac Studio available for purchase will do, and even a fairly low-end Macbook Pro would work as well. And a 5090 should be able to handle it with room to spare as well.

dur-randir · 2025-08-05T06:10:04 1754374204

Memory bandwidth is only relevant for comparing LLM performance. For image generation, the limiting factor is compute, and Apple sucks with it.

BoredPositron · 2025-08-05T09:17:15 1754385435

If you want to wait 20 minutes for one image you can certainly run it on a macbook pro.

roenxi · 2025-08-05T10:30:17 1754389817

The quality doesn't have to get much higher for that to be a great deal. For humans the wait time is typically measured in days.

BoredPositron · 2025-08-05T11:08:34 1754392114

Tell me you have no experience with generative ai image models nor with human artists.

roenxi · 2025-08-05T12:58:48 1754398728

What experience do you want to point too? I've never seen an artist streaming where they can draw something equivalent to a good piece of AI artwork in 20 minutes. Their advantage right now comes from a higher overall cap on quality of the work. Minute for minute, AIs are much better. It is just that it is pointless giving a typical AI more than a a little time on a GPU because current models can't consistently improve their own work.

jacquesm · 2025-08-06T06:55:01 1754463301

"a good piece of AI artwork"

You really don't understand art. At all.

roenxi · 2025-08-06T11:04:46 1754478286

If you need a hug, I suspect unfortunately I am on the wrong continent. Try thinking some positive thoughts.

RossBencina · 2025-08-04T21:31:31 1754343091

Does M3 Ultra or later have hardware FP8 support on the CPU cores?

reissbaker · 2025-08-04T23:38:48 1754350728

Ah, you're right: it doesn't have dedicated FP8 cores, so you'd get significantly worse performance (a quick Google search implies 5x worse). Although you could still run the model, just slowly.

Any M3 Ultra Mac Studio, or midrange-or-better Macbook Pro, would handle FP16 with no issues though. A 5090 would handle FP8 like a champ and a 4090 could probably squeeze it in as well, although it'd be tight.

slickytail · 2025-08-04T23:12:31 1754349151

All of this only really applies to LLMs though. LLMs are memory bound (due to higher param counts, KV caching, and causal attention) whereas diffusion models are compute bound (because of full self attention that can't be cached). So even if the memory bandwidth of an M3 ultra is close to an Nvidia card, the generation will be much faster on a dedicated GPU.

cma · 2025-08-04T20:15:43 1754338543

If 40GB you can lightly quantize and fit it on a 5090.

AuryGlenz · 2025-08-05T06:50:02 1754376602

Which very few people have, comparatively.

Training it will also be out of reach for most. I’m sure I’ll be able to handle it on my own 5090 at some point but it’ll be slow going.

TacticalCoder · 2025-08-04T20:07:56 1754338076

> I think the fact that, as far as I understand, it takes 40GB of VRAM to run, is probably dampening some of the enthusiasm.

40 GB of VRAM? So two GPU with 24 GB each? That's pretty reasonable compared to the kind of machine to run the latest Qwen coder (which btw are close to SOTA: they do also beat proprietary models on several benchmarks).

cellis · 2025-08-04T21:36:26 1754343386

A 3090 + 2xTitanXP? technically i have 48, but i don't think you can "split it" over multiple cards. At least with Flux, it would OOM the Titans and allocate the full 3090

AuryGlenz · 2025-08-05T06:50:34 1754376634

You can’t split image models over 2 GPUs like you can LLMs.

BoredPositron · 2025-08-05T09:18:39 1754385519

They also released an inference server for their models. Wan and qwen-image can be split without problems. https://github.com/modelscope/DiffSynth-Engine

AuryGlenz · 2025-08-06T20:54:09 1754513649

Unless I missed something just from skimming their tutorial it looks like they can do parallelism to speed things up with some models, not actually split the model (apart from the usual chunk offloading techniques).