Looking at the example where the coffee table is swapped, I notice every time th...

og_kalu · 2025-04-08T13:30:56 1744119056

It's kind of clear that for every request, it generates a new image entirely. Some people are speculating a diffusion decoder but i think it's more likely an implementation of VAR - https://arxiv.org/abs/2404.02905.

So rather than predicting each patch at the target resolution right away, it starts with the image (as patches) at a very small resolution and increasingly scales up. I guess that could make it hard for the model to learn to just copy and paste image tokens for editing like it might for text.

flkiwi · 2025-04-08T13:40:40 1744119640

BUT it's doing a stunningly better job replicating previous scenes than it did before. I asked it just now for a selfie of two biker buddies on a Nevada highway, but one is a quokka and one is a hyrax. It did it. Then I asked for the same photo with late afternoon lighting, and it did a pretty amazing job of preserving the context where just a few months ago it would have had no idea what it had done before.

Also, sweet jesus, after more than a year of hilarious frustration, it now knows that a flying squirrel is a real animal and not just a tree squirrel with butterfly wings.

og_kalu · 2025-04-08T13:47:24 1744120044

I agree. I'm not saying it's a different model generating the images. 4o is clearly generating the images itself rather than sending a prompt to some other model. I'm speculating about the mechanism for generation in the model itself.

flkiwi · 2025-04-08T13:52:40 1744120360

Oh, no, I wasn't taking issue with what you said, just reacting that, yes, it's not editing the same image but redrawing from scratch every time, BUT it's doing a much better job of that, with some understanding of the context of the previous image so that it can tweak it, even if it's never bit for bit identical.

M4v3R · 2025-04-08T11:27:40 1744111660

Yeah, this is in my opinion the biggest limitation of the current gen GPT 4o image generation: it is incapable of editing only parts of an image. I assume what it does every time is tokenizing the source image, then transforming it according to the prompt and then giving you the final result. For some use cases that’s fine but if you really just want a small edit while keeping the rest of the image intact you’re out of luck.

atommclain · 2025-04-08T14:59:38 1744124378

I thought the selection tool allows you to limit the area of the image that a revision will make changes to, but I tested it and I still see changes outside of the selected area which is good to know.

As an example the tape spindles, among other changes, are different: https://chatgpt.com/share/67f53965-9480-800a-a166-a6c1faa87c...

https://help.openai.com/en/articles/9055440-editing-your-ima...

qingcharles · 2025-04-08T16:54:22 1744131262

Yeah, I'm not sure what the selection brush actually does. Is it just a hint to the LLM?

danielbln · 2025-04-08T12:20:16 1744114816

It just means that you comp it together manually. That's still much better than having to set up some inpainting pipeline or whatever.

wavemode · 2025-04-08T14:20:33 1744122033

Is manually comping actually going to be easier (let alone, give better results) than inpainting? I can imagine it working in simple cases, but for anything involving 3D geometry you'll likely run into issues of things not quite lining up between the first and second image.

echelon · 2025-04-08T13:37:10 1744119430

100%. Multimodal images surpass ComfyUI and inpainting (for now). It's a step function improvement in image generation.

I'm hoping we see an open weights or open source model with these capabilities soon, because good tools need open models.

As has happened in the past, once an open implementation of DallE or whatever comes out, the open source community pushes the capabilities much further by writing lots of training, extensions, and pipelines. The results look significantly better than closed SaaS models.

iandanforth · 2025-04-09T02:00:51 1744164051

Fwiw pixlr is a good pairing with GPT 4o for just this. Generate with 4o then use pixlr AI tools to edit bits. Especially for removals pixlr (and I'm sure others) are much much faster and quite reliable.

bla3 · 2025-04-08T10:40:26 1744108826

The pictures on the wall change too.

rob74 · 2025-04-08T11:27:28 1744111648

Actually, almost everything changes slightly - the number, shape and pattern of the chairs, the number and pattern of the pillows, the pattern of the curtains, the scene outside the window, the wooden part of the table, the pattern of the carpet... The blue couch stays largely the same, it just loses some detail...

card_zero · 2025-04-08T10:44:30 1744109070

Yes, first a still life and something impressionist, then a blob and a blob, then a smear and a smear. And what about the reflections and transparency of the glass table top? It gets very indistinct. Keep working at the same image and it looks like you'll end up with some Deep Dream weirdness.

I think the fireplace might be turning into some tiny stairs leading down. :)

YurgenJurgensen · 2025-04-08T10:48:00 1744109280

Only sailors know how to leave.

empath75 · 2025-04-08T14:46:36 1744123596

The vast majority of people wouldn't notice any of that in most contexts in which such an image would be used.