4o still exhibits the "pink elephant effect", it's just... subtler, and tends to reveal itself on a complex or confusing prompt. Negations are also still not handled properly, they tend to slightly confuse the model and decrease the accuracy of the answer or the generated picture. The same is true for any other LLM. Moreover, the author is asking the model to rationalize the decision he already made ("tell me why there can't be any elephants"), which could work as an equivalent to a CoT step.
It's "just" a much bigger and much better trained model. Which is a quality on its own, absolutely no doubt about that. Fundamentally the issue is still there though, just less prominent. Which kind of makes sense - imagine the prompt "not green", what even is that? It's likely slightly out of distribution and requires representing a more complex abstraction, so the accuracy will necessarily be worse than stating the range of colors directly. The result might be accurate, until the model is confused/misdirected by something else, and suddenly it's not.
I think in the end none of the architectural differences will matter beyond the scaling. What will matter a lot more is data diversity and training quality.
But it's literally a different architecture (auto-regressive, presumably sequence based vs diffusion). In my experiments it is significantly, overwhelmingly better at consistency, coherence and prompt adherence. Things I needed control nets before it just... does it. and even zooming into fine details, they make sense.
Of course, it's a tiny specialized model vs a big generalist model. They're absolutely incomparable in size/quality though, especially of the text part. How much of this happens because of the poor encoder and worse training in other models, and how much due to the architectural differences? I'm not saying it's not better than the existing image gen models somehow, but it's pretty hard to separate the two since both are present. All current SotA LLMs including 4o itself have negation inaccuracy in text (you need a really complex prompt, or a long one with thousands of tokens, not a toy one), and I don't see why this one should behave differently in similar conditions. Especially considering that it also suffers from pretty much the same artifacts as other image models, just much less (fingers, extra limbs, perspective/lighting issues, overfitting, struggles with out-of-distribution generation etc.)
I'm not convinced. I tried it and it showed me a swimming hippopotamus, which is even more elephant-like than the turtle. I tried again and it gave me a pelican, which is not generally very elephantish, but this particular one has a gray body with a texture that looks a lot like elephant skin.
It's "just" a much bigger and much better trained model. Which is a quality on its own, absolutely no doubt about that. Fundamentally the issue is still there though, just less prominent. Which kind of makes sense - imagine the prompt "not green", what even is that? It's likely slightly out of distribution and requires representing a more complex abstraction, so the accuracy will necessarily be worse than stating the range of colors directly. The result might be accurate, until the model is confused/misdirected by something else, and suddenly it's not.
I think in the end none of the architectural differences will matter beyond the scaling. What will matter a lot more is data diversity and training quality.