Someone on Reddit did promptless img2img in comfy by passing an image into vae decode and then thru the schnell model for a kind of a refiner, with great results
We didn't want to use LoRA to maximize quality, so we used 32 A100-80GB with a sequence length of 4096. It's possible to do a native fine-tune on as little as 8 A100-80GB with DeepSpeed Zero 3, but it will take longer.
With LoRA you can probably get away with just a few 4090s.