Yes: from TFA, SD XL released some months ago uses a new VAE. n.b. clarifying be...

joefourier · on Feb 1, 2024

The SD-XL VAE doesn’t take into account any of those insights, it’s the exact same as the SD1/2 one, just trained from scratch with a batch size of 256 instead of 9 and with EMA.

refulgentis · on Feb 1, 2024

No. Idk where you got this idea.

joefourier · on Feb 1, 2024

From the SD-XL paper:

> To this end, we train the same autoencoder architecture used for the original Stable Diffusion at a larger batch-size (256 vs 9) and additionally track the weights with an exponential moving average. The resulting autoencoder outperforms the original model in all evaluated reconstruction metrics

And if you look at the SD-XL VAE config file, it has a scaling factor of 0.13025 while the original SD VAE had one of 0.18215 - so meaning it was also trained with an unbounded output. The architecture is also the exact same if you inspect the model file.

But if you have any details about the training procedure of the new VAE that they didn’t include in the paper, feel free to link to them, I’d love to take a look.

jamilton · on Feb 1, 2024

Can someone provide evidence one way or the other? I don’t know enough to do it myself.

refulgentis · on Feb 1, 2024

c.f. https://news.ycombinator.com/item?id=39220027, or TFA*. They're doing a gish gallop, and I can't really justify burning more karma to poke holes in a stranger's overly erudite tales. I swing about 8 points to the negative when they reply with more.

* multiple sources including OP:

"The SDXL VAE of the same architecture doesn't have this problem,"

"If future models using KL autoencoders do not use the pretrained CompVis checkpoints and use one like SDXL's that is trained properly, they'll be fine."

"SDXL is not subject to this issue because it has its own VAE, which as far as I can tell is trained correctly and does not exhibit the same issues."

joefourier · on Feb 2, 2024

I think you must have misunderstood me, I didn’t say the SD-XL VAE had the same issue as in OP. What I said was that it didn’t take into account some of my points that came up during my research:

- Bounding the outputs to -1, 1 and optimising the variance directly to make it approach 1

- Increasing the number of channels to 8, as the spatial resolution reduction is most important for latent diffusion

- Using a more modern discriminator architecture instead of PatchGAN’s

- Using a vanilla AE with various perturbations instead of KL divergence

Now SD-XL’s VAE is very good and superior to its predecessor, on account of an improved training procedure, but it didn’t use any of the above tricks. It may even be the case that they would have made no difference in the end - they were useful to me in the context of training models with limited compute.