Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Atila from Apple on the expected performance:

> For distilled StableDiffusion 2 which requires 1 to 4 iterations instead of 50, the same M2 device should generate an image in <<1 second

https://twitter.com/atiorh/status/1598399408160342039



With the full 50 iterations it appears to be about 30s on M1.

They have some benchmarks on the github repo: https://github.com/apple/ml-stable-diffusion

For reference, previously I was getting about <3 minutes for 50 iterations on my Macbook Air M1. I haven't yet tried Apple's implementation but it looks like a huge improvement. It might take it from "possible" to "usable".


Yeah, it is just PyTorch MPS backend is not fully baked and have some slowness. You should be able to get close to that number with maple-diffusion (probably 10% slower) or my app: https://drawthings.ai/ (probably around 20% slower, but it supports samplers that takes less steps (50 -> 30)).


For comparison, it's also taking ~3min @ 50 iterations on my 12c Threadripper using OpenVino. It sounds like the improvements bring the M1 performance roughly in line with a GTX 1080.


The Apple Neural Engine in the m1 is supposed to be able to perform 11 tops. The GTX 1080 about 9-11 tflops.

So sounds plausible that the m1 can reach the same level in some use cases with the right optimizations.


I have Macbook Air M1, which is passively cooled. When cooled properly, that is thermal pad mod combined with a fan under the laptop, I'm getting closer to 2min - something like 2.8s per iteration. I guess it would be something 140s for 50 iterations on a MacBook Pro or Mac mini for M1.


This is accurate re: M1 Mac Mini times IME


Not SD2.0 but SD1.5, I am getting 30 iterations in 10 seconds on 1080ti. 50 iterations 18 seconds. 100%|| 30/30 [00:10<00:00, 2.84it/s]


How do dreamstudio/craiyon/hugging face manage to do seemingly quicker on their interfaces? Are they hosting these models on super beefy and costly GPUs for free?


M1's single-threaded CPU performance and power efficiency are exceptional; however M1's GPU performance is nothing special compared to normal discrete GPUs. You don't need something super beefy to beat M1 on the GPU side.

But also yes, it's gotta be expensive to host these models and I'm not sure where all these subsidies are coming from. I expect that we'll eventually see these things transition to more paid services.


For a low-power SoC, the GPU performance is actually pretty impressive. We recently did some transformer benchmarks and the inference performance of the M1 Max is almost half that of an RTX3090:

https://explosion.ai/blog/metal-performance-shaders

However the SoC only uses 31W when posting that performance.


Haven't tried this yet, but sounds slower than SD itself if you use one of the alt builds that supports mps where it had been cuda.

Mac Studio with M1 Ultra gets 3.3 iters/sec for me.

MacBook Pro M1 Max gets 2.8 iters/sec for me.


You’re talking about the higher end SKUs with many more GPU cores though and significantly more RAM (I think the lowest you can get is 32GB vs the 8 on their chip)


If you told me this was possible when I bought an M1 Pro less than a year ago, I wouldn’t believe you. This is insane.


Agreed.

And the posted benchmarks for the M2 Macbook Air make me consider 'upgrading' to an Air.


That laptop feels like liquid power. It's uncanny.

Macbook Airs (way back when) felt sluggish. The MBA M1 changed that, it was "fine". These M2s are unexpectedly responsive on an ongoing basis.

The MacBook Pro M1 Max is great (would be fantastic except they lost a Thunderbolt port in favor of legacy HDMI and memory card jacks), but you expect that machine to be responsive, so it's less surprising.

The Studio Ultra, though, never slows down for anything.

Still, if the Air could drive two external screens instead of one, I'd "downgrade" from the Max.


I'd give the M1 air more credit - I moved from a 2019 16" Pro to the Air and performance was nearly identical except for long running tasks (> 10 minutes.) So for mobile app builds, it was blazing fast. And in the meantime the intel machine was blaring fans after the first 30 seconds while the Air barely got warm.And then the real kicker was watching the battery on the intel machine visibly dropping a few percentage points, while the air sits at the same level the whole time.

I've since moved to the M2 air, and it is noticeably faster than M1, but it isn't the huge leap from last gen intel that the M1 was. But the hardware itself feels way better.


I dont like lack of open source drivers, but honestly for work DisplayLink works just fine on MacOS. E.g I used 4 monitors on M1 Air using DisplayLink:

* Air built-in display

* 2K display connected via USB-C -> DisplayPort adapter

* Two more 2K displays of same model via DisplayLink connected via USB hub

For all practical means it's almost impossible to see any DisplayLink compression artifacts even in most of games.

PS: Each adapter cost me $40:

https://www.amazon.com/gp/product/B08HN2X88P/


Appreciate this reply, TY for sharing the exact product that's working for you!

Been nervous to dip into it, given the architecture change and last year's challenges with display link docks.

// UPDATE: Oops, looking at the product, I see I should have specified: 4K screens or higher. About half our desks are 2 x 4K, about half 2 x 5K, except the Air M1 folks who are 1 x 5K.


Sadly I can only report it working on 2560x1440. Even though lower resolution is specified on Amazon.

For higher resolution some other solution is required.


Last nail in the coffin for DALL·E.


Not really, everyone will have their own flavor on how to rapidly train the model.

Dall-e et. al will still be able to bandwagon off of all the free ecosystem being built around the $10M SD1.4 model that is showing what is possible.

E.g. Dall-e could go straight to Hollywood if their model training works better than SD’s. The toolsets will work


source for the $10m number? i havent heard that one before, everyone just keeps parrotting the 600k single run number that is obviously misleading


yeah, finally we see the real openAI


more open than open source, it's the open model age


The true metric contains the output quality of the image, not just the speed. DALL-E output is, generally, much better for things that aren't standard looking.


If that's the metric, MidJourney --v 4 --q 2 is the leader, and it's not close.


I think they can move upmarket just as well as anyone else.


SD2 is the one that was neutered, right?

Maybe a dumb question but can the old model still be run?


It's less versatile out of the box. Give it a couple months for the community to catch up. Everyone is still figuring out what goes where, and SD 1.x was "everything goes in one spot." It was cool and powerful, but limited.


You can still do nice things with SD2, it just requires a different approach. https://news.ycombinator.com/item?id=33780543


Also, can you not "upgrade" but still run new models?


You can do anything you want.

SD2 wasn’t “neutered”, the piece of it from OpenAI that knew a lot of artist names but wasn’t reproduceable was replaced with a new one from Stability that doesn’t. You can fine-tune anything you want back in.


The training-set was nerfed really good as well, it wasn't just OpenCLIP that was replaced. They will successively re-admit more training data during the 2.x releases I guess.


Yes, they removed some NSFW which might've hurt it, but releasing models that can generate CP /will/ get you in legal trouble.

The "in the style of Greg Rutkowski" prompts from SD1 though, IIRC, were thought to be proof it was reproducing the training set. But it actually only saw ~27 images of his, and the rest was residual biases from CLIP.


Note that this is extrapolation for the distilled model which isn't released quite yet. (but it will be very exciting when it does!)


i'm very ignorant here so forgive me but if it can generate images that fast can it be used to generate a video?


There are different requirements for generating video -- at a minimum, continuity is tough. There are models for producing video, but (as far as I've seen) they're still a bit wobbly.


Yeah, sure. The issue is with temporal consistency. Meta and Google have some successes in that area.

https://mezha.media/en/2022/10/06/google-is-working-on-image...

Give it some time and SD will be able to do the same.


They already do, with varying levels of performance and success.

See deforum[1] and andreasjansson‘s stable-diffusion-animation[2]

[1]: https://deforum.github.io/

[2]: https://replicate.com/andreasjansson/stable-diffusion-animat...


Video is really a series of frames, the framerate for film/human can get away with 24 frames/second-- so maybe ~40ms/image for real-time at least?

What's cool about the era in which we live is if you look at high-performance graphics for games or simulations, for instance, it may in fact be faster to a the model to "enhance" a low-resolution frame rather than trying to render it fully on the machine.

ex. AMD's FSR vs NVIDIA DLSS

- AMD FSR (Fidelity FX Super Resolution): https://www.amd.com/en/technologies/fidelityfx-super-resolut...

- NVIDIA DLSS (Deep Learning Super Sampling): jhttps://www.nvidia.com/en-us/geforce/technologies/dlss/

AMD's approach renders the game at a crummy, low-detail resolution then each frame uses "upscales"

Both FSR and DLSS aim to improve frames-per-second in games by rendering them below your monitor’s native resolution, then upscaling them to make up the difference in sharpness. Currently, FSR uses spatial upscaling, meaning it only applies its upscaling algorithm to one frame at a time. Temporal upscalers, like DLSS, can compare multiple frames at once, to reconstruct a more finely-detailed image that both more closely resembles native res and can better handle motion. DLSS specifically uses the machine learning capabilities of GeForce RTX graphics cards to process all that data in (more or less) real time.

Video is really a series of frames, the framerate for film/human could get away with 24 frames/second-- ~40ms/image for real-time.

What's cool about the era in which we live is if you look at high-performance graphics for games or simulations, it may in fact be faster to run the model on each frame to "enhance" a low-resolution frame rather than trying to render it fully on the machine.

ex. AMD's FSR vs NVIDIA DLSS

- AMD FSR (Fidelity FX Super Resolution): https://www.amd.com/en/technologies/fidelityfx-super-resolut...

- NVIDIA DLSS (Deep Learning Super Sampling): https://www.nvidia.com/en-us/geforce/technologies/dlss/

AMD's approach renders the game at a crummy, low-detail resolution then use "spatial upscaling" to enhance the images one frame at a time.

NVIDIA DLSS uses "temporal upscaling" to pass over multiple frames and uses other capabilities exclusive to Nvidia's cards to stitch together the frames.

This is a different challenge than generating the content from scratch

I don't think this is possible in real-time yet, but someone put a filter trained on the German country side to produce photorealistic Grand Theft Auto driving gameplay:

https://www.youtube.com/watch?v=P1IcaBn3ej0

Notice the mountains in the background go from Southern California brown to lush green

https://www.rockpapershotgun.com/amd-fsr-20-is-a-more-demand....


FSR 2.0 also uses temporal information and movement vectors to upscale, for what it's worth. DLSS 2.0 also renders at a lower resolution and upscales it. DLSS 3.0 frame generation is interesting, in that it holds "back" a frame and generates an extra one in between frame 1 and frame 2, allowing you to boost perceived frame rate massively, at the cost of some artifacting right now.


You can generate video a lot more efficiently than frame by frame. For example, you can generate every other frame and use something like DLSS 3.0 to fill in the missing ones.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: