I think you must have misunderstood me, I didn’t say the SD-XL VAE had the same ...

I think you must have misunderstood me, I didn’t say the SD-XL VAE had the same issue as in OP. What I said was that it didn’t take into account some of my points that came up during my research:

- Bounding the outputs to -1, 1 and optimising the variance directly to make it approach 1

- Increasing the number of channels to 8, as the spatial resolution reduction is most important for latent diffusion

- Using a more modern discriminator architecture instead of PatchGAN’s

- Using a vanilla AE with various perturbations instead of KL divergence

Now SD-XL’s VAE is very good and superior to its predecessor, on account of an improved training procedure, but it didn’t use any of the above tricks. It may even be the case that they would have made no difference in the end - they were useful to me in the context of training models with limited compute.