For reference, previously I was getting about <3 minutes for 50 iterations on my Macbook Air M1. I haven't yet tried Apple's implementation but it looks like a huge improvement. It might take it from "possible" to "usable".
Yeah, it is just PyTorch MPS backend is not fully baked and have some slowness. You should be able to get close to that number with maple-diffusion (probably 10% slower) or my app: https://drawthings.ai/ (probably around 20% slower, but it supports samplers that takes less steps (50 -> 30)).
For comparison, it's also taking ~3min @ 50 iterations on my 12c Threadripper using OpenVino. It sounds like the improvements bring the M1 performance roughly in line with a GTX 1080.
I have Macbook Air M1, which is passively cooled. When cooled properly, that is thermal pad mod combined with a fan under the laptop, I'm getting closer to 2min - something like 2.8s per iteration. I guess it would be something 140s for 50 iterations on a MacBook Pro or Mac mini for M1.
How do dreamstudio/craiyon/hugging face manage to do seemingly quicker on their interfaces? Are they hosting these models on super beefy and costly GPUs for free?
M1's single-threaded CPU performance and power efficiency are exceptional; however M1's GPU performance is nothing special compared to normal discrete GPUs. You don't need something super beefy to beat M1 on the GPU side.
But also yes, it's gotta be expensive to host these models and I'm not sure where all these subsidies are coming from. I expect that we'll eventually see these things transition to more paid services.
For a low-power SoC, the GPU performance is actually pretty impressive. We recently did some transformer benchmarks and the inference performance of the M1 Max is almost half that of an RTX3090:
You’re talking about the higher end SKUs with many more GPU cores though and significantly more RAM (I think the lowest you can get is 32GB vs the 8 on their chip)
That laptop feels like liquid power. It's uncanny.
Macbook Airs (way back when) felt sluggish. The MBA M1 changed that, it was "fine". These M2s are unexpectedly responsive on an ongoing basis.
The MacBook Pro M1 Max is great (would be fantastic except they lost a Thunderbolt port in favor of legacy HDMI and memory card jacks), but you expect that machine to be responsive, so it's less surprising.
The Studio Ultra, though, never slows down for anything.
Still, if the Air could drive two external screens instead of one, I'd "downgrade" from the Max.
I'd give the M1 air more credit - I moved from a 2019 16" Pro to the Air and performance was nearly identical except for long running tasks (> 10 minutes.) So for mobile app builds, it was blazing fast. And in the meantime the intel machine was blaring fans after the first 30 seconds while the Air barely got warm.And then the real kicker was watching the battery on the intel machine visibly dropping a few percentage points, while the air sits at the same level the whole time.
I've since moved to the M2 air, and it is noticeably faster than M1, but it isn't the huge leap from last gen intel that the M1 was. But the hardware itself feels way better.
I dont like lack of open source drivers, but honestly for work DisplayLink works just fine on MacOS. E.g I used 4 monitors on M1 Air using DisplayLink:
* Air built-in display
* 2K display connected via USB-C -> DisplayPort adapter
* Two more 2K displays of same model via DisplayLink connected via USB hub
For all practical means it's almost impossible to see any DisplayLink compression artifacts even in most of games.
Appreciate this reply, TY for sharing the exact product that's working for you!
Been nervous to dip into it, given the architecture change and last year's challenges with display link docks.
// UPDATE: Oops, looking at the product, I see I should have specified: 4K screens or higher. About half our desks are 2 x 4K, about half 2 x 5K, except the Air M1 folks who are 1 x 5K.
The true metric contains the output quality of the image, not just the speed. DALL-E output is, generally, much better for things that aren't standard looking.
It's less versatile out of the box. Give it a couple months for the community to catch up. Everyone is still figuring out what goes where, and SD 1.x was "everything goes in one spot." It was cool and powerful, but limited.
SD2 wasn’t “neutered”, the piece of it from OpenAI that knew a lot of artist names but wasn’t reproduceable was replaced with a new one from Stability that doesn’t. You can fine-tune anything you want back in.
The training-set was nerfed really good as well, it wasn't just OpenCLIP that was replaced. They will successively re-admit more training data during the 2.x releases I guess.
Yes, they removed some NSFW which might've hurt it, but releasing models that can generate CP /will/ get you in legal trouble.
The "in the style of Greg Rutkowski" prompts from SD1 though, IIRC, were thought to be proof it was reproducing the training set. But it actually only saw ~27 images of his, and the rest was residual biases from CLIP.
There are different requirements for generating video -- at a minimum, continuity is tough. There are models for producing video, but (as far as I've seen) they're still a bit wobbly.
Video is really a series of frames, the framerate for film/human can get away with 24 frames/second-- so maybe ~40ms/image for real-time at least?
What's cool about the era in which we live is if you look at high-performance graphics for games or simulations, for instance, it may in fact be faster to a the model to "enhance" a low-resolution frame rather than trying to render it fully on the machine.
AMD's approach renders the game at a crummy, low-detail resolution then each frame uses "upscales"
Both FSR and DLSS aim to improve frames-per-second in games by rendering them below your monitor’s native resolution, then upscaling them to make up the difference in sharpness. Currently, FSR uses spatial upscaling, meaning it only applies its upscaling algorithm to one frame at a time. Temporal upscalers, like DLSS, can compare multiple frames at once, to reconstruct a more finely-detailed image that both more closely resembles native res and can better handle motion. DLSS specifically uses the machine learning capabilities of GeForce RTX graphics cards to process all that data in (more or less) real time.
Video is really a series of frames, the framerate for film/human could get away with 24 frames/second-- ~40ms/image for real-time.
What's cool about the era in which we live is if you look at high-performance graphics for games or simulations, it may in fact be faster to run the model on each frame to "enhance" a low-resolution frame rather than trying to render it fully on the machine.
AMD's approach renders the game at a crummy, low-detail resolution then use "spatial upscaling" to enhance the images one frame at a time.
NVIDIA DLSS uses "temporal upscaling" to pass over multiple frames and uses other capabilities exclusive to Nvidia's cards to stitch together the frames.
This is a different challenge than generating the content from scratch
I don't think this is possible in real-time yet, but someone put a filter trained on the German country side to produce photorealistic Grand Theft Auto driving gameplay:
FSR 2.0 also uses temporal information and movement vectors to upscale, for what it's worth. DLSS 2.0 also renders at a lower resolution and upscales it. DLSS 3.0 frame generation is interesting, in that it holds "back" a frame and generates an extra one in between frame 1 and frame 2, allowing you to boost perceived frame rate massively, at the cost of some artifacting right now.
You can generate video a lot more efficiently than frame by frame. For example, you can generate every other frame and use something like DLSS 3.0 to fill in the missing ones.
> For distilled StableDiffusion 2 which requires 1 to 4 iterations instead of 50, the same M2 device should generate an image in <<1 second
https://twitter.com/atiorh/status/1598399408160342039