This is absolutely incredible. It takes about 45 seconds to generate a whole image on my iPhone SE3 - which is about as fast as my M1 Pro macbook was doing it with the original version!
SE 3rd Gen has 4GiB RAM, therefore the app defaults to 384x384 size. This is about 1/2 computation of your normal run (512x512) and the original version uses PLMS sampler, which defaults to 50 steps, while this one uses the newer DPM++ 2M Karras sampler, that defaults to 30 steps. All in all, your M1 Pro MBP is still 4x of your SE 3rd Gen in raw performance (although my implementation should be faster than PyTorch at about 2x on M1 chips)
For what it's worth, you can decrease resolution and use the sampler mentioned on the pytorch versions. The AUTOMATIC web UI supports this, for instance.
I would also welcome the additional optimizations, however.