There is circumstantial evidence out there that 4o image manipulation isn't done within the 4o image generator in one shot but is a workflow done by an agentic system. Meaning this, user inputs prompt "create an image with no elephants in the room" > prompt goes to an llm which preprocesses the human prompt > outputs a a prompt that it knows works withing this image generator well > create an image of a room > and that llm processed prompt is sent to the image generator. Same happens with edits but a lot more complicated, meaning function calling tools are involved with many layers of edits being done behind the scenes. Try it yourself, take an image, send it it, and have the 4o edit it for you in some way, then ask it to edit again, and again, and so on. you will notice noticeable sepia filter being applied every edit, and the image ends up more and more sepia toned with more edits. This is because in the workflow that is one of the steps that is naively applied without consideration of multi edit possibility. If this was a one shot solution where editing is done within 4o image model by itself, the sepia problem wouldn't be there.
As somebody who actually tried to build a multimodal stable diffusion chat agent about a year back using YOLO to build partial masks for adjustments via inpainting, dynamic controlnets, and a whole host of other things, I highly doubt that it's as simple as an agentic process.
Using the prompt to detect and choose the most appropriate model checkpoint and LoRa(s) along with rewriting a prompt to most appropriately suit the chosen model has been pretty bog standard for a long time now.
> Using the prompt to detect and choose the most appropriate model checkpoint and LoRa(s) along with rewriting a prompt to most appropriately suit the chosen model has been pretty bog standard for a long time now.
Which players are doing this? I haven't heard of this approach at all.
Most artistic interfaces want you to visually select a style (LoRA, Midjourney sref, etc.) and will load these under the hood. But it's explicit behavior controlled by the user.
None of your observations say anything about how these images are generated one way or another.
The only thing we currently have to go off of is OpenAI's own words, which claims the images are generated by a single multimodal model autoregressively, and I don't think they are lying.
Generated autoregressively and generated in one shot are not the same. There is a possibility that there is a feedback loop here. Personally, I wouldn't be surprised if there was a small one, but not nearly the complex agentic workflow that OP may be thinking of.
>This is because in the workflow that is one of the steps that is naively applied without consideration of multi edit possibility. If this was a one shot solution where editing is done within 4o image model by itself, the sepia problem wouldn't be there.
I don't really see that with chatgpt, what I do see is that it's presumably running the same basic query with just whatever you said different each time instead of modifying the existing image. Like if you say "generate a photo of a woman", and get a pic and then say "make her hair blonde", the new image is likely to also have different facial features.
The prompt enrichment thing is pretty standard. Everyone does that bit, though some make it user-visible. On Grok it used to populate to the frontend via the download name on the image. The image editing is interesting.
All the stable diffusion software I've used names the files after some form of the prompt, and probably because SD weights the first tokens higher than the last tokens, probably as a side effect of the way the CLIP/BLIP works.
I doubt any of these companies have rolled their own interface to stable diffusion / transformers. It's copy and paste from huggingface all the way down.
I'm still waiting for a confirmed Diffusion Language Model to be released as gguf that works with llama.cpp
Auto1111 and co are using the prompt in the filename because it's convenient, not due to some inherent CLIP mechanism.
If you think that companies like OpenAI (for all the criticisms they deserve) don't use their own inference harness and image models I have a bridge to sell to you.
i give less weight to your opinion than my own. I'm not sure how you misunderstood what i said about clip/blip, as well. I was replying to a comment about "populating the front end with the filename" - the first tokens are weighted higher in the resulting image than the later tokens. And therefore, if you prompt correctly, the filenames will be a very accurate description of the image. Especially danbooru style, you can just split on space and use them as tags, for all practical purposes.
I guess the "convenience" just happened to get ported over from "Auto1111", or it's a coincidence, or
> There is circumstantial evidence out there that 4o image manipulation isn't done within the 4o image generator in one shot
I thought this was obvious? At least from the first time (and only time) I used it, you can clearly see that it's not just creating one image based on the prompt, but instead it first creates a canvas for everything to fit into, then it generates piece by piece, with some coordinator deciding the workflow.
Don't think we need evidence either way when it's so obvious from using it and what you can see while it generates the "collage" of images.
I mean, it could very well be that it generates image patches autoregressively, but in a pyramidal way (first a very low resolution version, the "canvas", and then each individual patch). This is very similar to VAR [1]
> This is because in the workflow that is one of the steps that is naively applied without consideration of multi edit possibility.
Unconvinced by that tbh. This could simply be a bias with the encoder/decoder or the model itself, many image generation models showed behaviour like this. Also unsure why a sepia filter would always be applied if it was a workflow, what's the point of this?
Personally, I don't believe this is just an agentic workflow. Agentic workflows can't really do anything a human couln't do manually, they just make the process much faster. I spent 2 years working with image models, specifically around controllability of the output, and there is just no way of getting this kind of edits with a regular diffusion model just through smarter prompting or other tricks. So I don't see how an agentic workflow would help.
I think you can only get there via a true multimodal model.
huh, i wa thinking myself based on how it looked that it was doing layers too.
The blurred backgrounds with sharp cartoon characters in front are what made me think this is how they do it.