More

yousif_123123 · 2025-05-06T18:48:23 1746557303

The opposite problem is also true. I was using it to edit code I had that was calling the new openai image API, which is slightly different from the dalle API. But Gemini was consistently "fixing" the OpenAI call even when I explained clearly not to do that since I'm using a new API design etc. Claude wasn't having that issue.

The models are very impressive. But issues like these still make me feel they are still more pattern matching (although there's also some magic, don't get me wrong) but not fully reasoning over everything correctly like you'd expect of a typical human reasoner.

disgruntledphd2 · 2025-05-06T19:00:42 1746558042

They are definitely pattern matching. Like, that's how we train them, and no matter how many layers of post training you add, you won't get too far from next token prediction.

And that's fine and useful.

mdp2021 · 2025-05-06T20:14:25 1746562465

> fine and useful

And crippled, incomplete, and deceiving, dangerous.

astrange · 2025-05-06T22:58:50 1746572330

That's normal for any professional tool, but it's not normal to be so upset about it. A saw will take your finger off, but you still want to use it for woodworking.

mdp2021 · 2025-05-06T23:26:23 1746573983

> A saw

No: that in context is a plaster cast saw that looks vibrational but is instead a rotational saw for wood, and you will tend to believe it has safety features it was really not engineered with.

For plaster casts you have to have to plan, design and engineer a proper apt saw - learn what you must from the experience of saws for wood, but it's a specific project.

toomuchtodo · 2025-05-06T18:56:42 1746557802

It seems like the fix is straightforward (check the output against a machine readable spec before providing it to the user), but perhaps I am a rube. This is no different than me clicking through a search result to the underlying page to verify the veracity of the search result surfaced.

disgruntledphd2 · 2025-05-06T19:02:08 1746558128

Why coding agents et al don't make use of the AST through LSP is a question I've been asking myself since the first release of GitHub copilot.

I assume that it's trickier than it seems as it hasn't happened yet.

xmcqdpt2 · 2025-05-07T11:11:52 1746616312

My guess is that it doesn’t work for several reasons.

While we have millions of LOCs to train models on, we don’t have that for ASTs. Also, except for LISP and some macro supporting languages, the AST is not usually stable at all (it’s an internal implementation detail). It’s also way too sparse because you need a pile of tokens for even simple operations. The Scala AST for 1 + 2 for example probably looks like this,

Apply(Select(scala, Select(math, Select(Int, Select(+)))), New(Literal(1)), Seq(This, New(Literal(2))) etc etc

which is way more tokens than 1 + 2. You could possibly use a token per AST operation but then you can’t train on human language anymore and you need a new LLM per PL, and you can’t solve problem X in language Y based on a solution from language Z.

disgruntledphd2 · 2025-05-08T14:23:14 1746714194

> While we have millions of LOCs to train models on, we don’t have that for ASTs

Agreed, but that could be generated if it made a big difference.

I do completely take your points around the instability of the AST and the length, those are important facets to this question.

However, what I (and probably others) want is something much, much simpler. Merely (I love not having to implement this so I can use this word ;) ) check the code with the completion done (so what the AI proposes) and weight down completions that increase the number of issues found from the type-checking/linting/lsp process.

Honestly, just killing the ones that don't parse properly would be very helpful (I've noticed that both Copilot and the DBX completers are particularly bad at this one).

Quiark · 2025-05-13T02:36:49 1747103809

Microsoft is working on it

celeritascelery · 2025-05-06T19:25:25 1746559525

What good do you think that would do?

disgruntledphd2 · 2025-05-07T06:07:17 1746598037

I've gotten a bunch of unbalanced parentheses suggestions, as well as loads of non existent variables generated.

One could use the LSP errors to remove those completions.

yousif_123123 · 2025-04-24T23:14:57 1745536497

It's also good but clearly not close still. Maybe Gemini 2.5 or 3 will have better image gen.

yousif_123123 · 2025-04-24T23:13:27 1745536407

It's a mix of both it feels to me as I've been testing it. For example, you can't get it to make a clock showing custom time like 3:30, or someone writing with their left hand.. And it can't do follow many instructions or do them very precisely. But it shows that this kind of architecture will be be capable of that if scaled up most likely.

jumploops · 2025-04-25T00:33:11 1745541191

These are great tests, thanks for sharing!

And you seem to be right, though the only reference I can find is in one of the example images of a whiteboard posted on the announcement[0].

It shows: tokens -> [transformer] -> [diffusion] pixels

hjups22 on Reddit[1] describes it as:

> It's a hybrid model. The AR component generates control embeddings that then get decoded by a diffusion model. But the control embeddings are accurate enough to edit and reconstruct the images surprisingly well.

[0]https://openai.com/index/introducing-4o-image-generation/

[1]https://www.reddit.com/r/MachineLearning/comments/1jkt42w/co...

yousif_123123 · 2025-04-25T17:28:05 1745602085

Yes. Also, when testing low vs high, it seems the difference is mainly in the diffusion part, as the structure of the image and the instruction following ability is usually the same.

Still, very exciting and for the future as well. It's still pretty expensive and slow. But moving in the right direction.

yousif_123123 · 2025-04-18T12:15:46 1744978546

I have noticed Gemini not accepting an instruction to "leave all other code the same but just modify this part" on a code that included use of an alpha API with a different interface than what Gemini knows is the correct current API. No matter how I promoted 2.5 pro, I couldn't get it to respect my use of the alpha API, it would just think I must be wrong.

So I think patterns from the training data are still overriding some actual logic/intelligence in the model. Or the Google assistant fine-tuning is messing it up.

Workaccount2 · 2025-04-18T13:57:23 1744984643

I have been using gemini daily for coding for the last week, and I swear that they are pulling levers and A/B testing in the background. Which is a very google thing to do. They did the same thing with assistant, which I was a pretty heavy user of back in the day (I was driving a lot).

yousif_123123 · 2025-04-15T16:35:16 1744734916

I do like the vinyl and analog amplifiers. I certainly hear the warmth in this case.

yousif_123123 · 2025-03-21T14:00:25 1742565625

Well I think there's many cheaper models in terms of bang for buck currently per token and intelligence than gpt4o. Other than OpenAI having very high rate limits and throughout available without a contract done with sales, I don't see much reason to use it currently instead of sonnet 3.5 or 3.7, or Google's Flash 2.0

Perhaps their training cost and their current inference cost is higher, but what you get as a customer is a more expensive product for what it is, IMO.

yousif_123123 · 2025-02-28T00:15:21 1740701721

It's disappointing not to see comparisons to Sonnet 3.7. Also since o3-mini is ahead of o1, not sure why in the video they compared to o1.

gpt4 was way ahead of 3.5 when it came out. It's unfortunate that the first major gpt release since that is so underwhelming..

j_bum · 2025-02-28T05:55:41 1740722141

Agreed, but I suppose this is a tell. I think they’re trying to place this into a separate class of models.

I.e., we know it might not be as good as 3.7, but it is very friendly and maybe acts like it knows more things.

yousif_123123 · 2025-01-01T15:23:12 1735744992

It does have it. If you tap on it, it should show it.

yousif_123123 · on June 30, 2024

Haven't read the paper yet, but looks like this can improve the ability of the model attention to work better, since many of these tasks end up being similar to these generic tasks.

Even gpt4 gets tripped up when there's too many exact instructions needed to be executed on an input. That's why it's common that breaking a task into multiple steps and multiple improves performance.

It's wonderful to see improvements possible on smaller models.

yousif_123123 · on June 21, 2024

One downside for diffusion based systems (and I'm very noob in this) is that the model won't be able to see it's input and output in the same space, therefore wouldn't be able to do follow-up instructions to fix things or improve on it. Where as an LLM generating html could follow instructions to modify it as well. It's input and output are the same format.

HanClinto · on June 21, 2024

Oh? I would think that the input prompt to drive generation is not lost during generation iterations -- but I also don't know much about the architectural details.