There is circumstantial evidence out there that 4o image manipulation isn't done...

vunderba · 2025-04-08T18:07:40 1744135660

As somebody who actually tried to build a multimodal stable diffusion chat agent about a year back using YOLO to build partial masks for adjustments via inpainting, dynamic controlnets, and a whole host of other things, I highly doubt that it's as simple as an agentic process.

Using the prompt to detect and choose the most appropriate model checkpoint and LoRa(s) along with rewriting a prompt to most appropriately suit the chosen model has been pretty bog standard for a long time now.

echelon · 2025-04-08T22:04:21 1744149861

> Using the prompt to detect and choose the most appropriate model checkpoint and LoRa(s) along with rewriting a prompt to most appropriately suit the chosen model has been pretty bog standard for a long time now.

Which players are doing this? I haven't heard of this approach at all.

Most artistic interfaces want you to visually select a style (LoRA, Midjourney sref, etc.) and will load these under the hood. But it's explicit behavior controlled by the user.

nialv7 · 2025-04-08T18:05:57 1744135557

None of your observations say anything about how these images are generated one way or another.

The only thing we currently have to go off of is OpenAI's own words, which claims the images are generated by a single multimodal model autoregressively, and I don't think they are lying.

pclmulqdq · 2025-04-08T21:45:35 1744148735

Generated autoregressively and generated in one shot are not the same. There is a possibility that there is a feedback loop here. Personally, I wouldn't be surprised if there was a small one, but not nearly the complex agentic workflow that OP may be thinking of.

Suppafly · 2025-04-09T02:52:02 1744167122

>This is because in the workflow that is one of the steps that is naively applied without consideration of multi edit possibility. If this was a one shot solution where editing is done within 4o image model by itself, the sepia problem wouldn't be there.

I don't really see that with chatgpt, what I do see is that it's presumably running the same basic query with just whatever you said different each time instead of modifying the existing image. Like if you say "generate a photo of a woman", and get a pic and then say "make her hair blonde", the new image is likely to also have different facial features.

renewiltord · 2025-04-08T16:48:09 1744130889

The prompt enrichment thing is pretty standard. Everyone does that bit, though some make it user-visible. On Grok it used to populate to the frontend via the download name on the image. The image editing is interesting.

genewitch · 2025-04-08T17:12:55 1744132375

All the stable diffusion software I've used names the files after some form of the prompt, and probably because SD weights the first tokens higher than the last tokens, probably as a side effect of the way the CLIP/BLIP works.

I doubt any of these companies have rolled their own interface to stable diffusion / transformers. It's copy and paste from huggingface all the way down.

I'm still waiting for a confirmed Diffusion Language Model to be released as gguf that works with llama.cpp

danielbln · 2025-04-08T17:23:32 1744133012

Auto1111 and co are using the prompt in the filename because it's convenient, not due to some inherent CLIP mechanism.

If you think that companies like OpenAI (for all the criticisms they deserve) don't use their own inference harness and image models I have a bridge to sell to you.

genewitch · 2025-04-08T20:31:17 1744144277

i give less weight to your opinion than my own. I'm not sure how you misunderstood what i said about clip/blip, as well. I was replying to a comment about "populating the front end with the filename" - the first tokens are weighted higher in the resulting image than the later tokens. And therefore, if you prompt correctly, the filenames will be a very accurate description of the image. Especially danbooru style, you can just split on space and use them as tags, for all practical purposes.

I guess the "convenience" just happened to get ported over from "Auto1111", or it's a coincidence, or

diggan · 2025-04-08T17:37:54 1744133874

> There is circumstantial evidence out there that 4o image manipulation isn't done within the 4o image generator in one shot

I thought this was obvious? At least from the first time (and only time) I used it, you can clearly see that it's not just creating one image based on the prompt, but instead it first creates a canvas for everything to fit into, then it generates piece by piece, with some coordinator deciding the workflow.

Don't think we need evidence either way when it's so obvious from using it and what you can see while it generates the "collage" of images.

andy12_ · 2025-04-09T08:12:32 1744186352

I mean, it could very well be that it generates image patches autoregressively, but in a pyramidal way (first a very low resolution version, the "canvas", and then each individual patch). This is very similar to VAR [1]

We can't really be sure until OpenAI tells us.

[1] https://arxiv.org/pdf/2404.02905

Voloskaya · 2025-04-08T17:53:01 1744134781

> This is because in the workflow that is one of the steps that is naively applied without consideration of multi edit possibility.

Unconvinced by that tbh. This could simply be a bias with the encoder/decoder or the model itself, many image generation models showed behaviour like this. Also unsure why a sepia filter would always be applied if it was a workflow, what's the point of this?

Personally, I don't believe this is just an agentic workflow. Agentic workflows can't really do anything a human couln't do manually, they just make the process much faster. I spent 2 years working with image models, specifically around controllability of the output, and there is just no way of getting this kind of edits with a regular diffusion model just through smarter prompting or other tricks. So I don't see how an agentic workflow would help.

I think you can only get there via a true multimodal model.

lawlessone · 2025-04-08T18:32:59 1744137179

huh, i wa thinking myself based on how it looked that it was doing layers too. The blurred backgrounds with sharp cartoon characters in front are what made me think this is how they do it.