Can you share the question? Or are you intentionally trying to keep it out of th...

magicalhippo · 2025-08-07T21:45:29 1754603129

Sadly no. I'd like to keep it untainted, but also because the tables involved are straight from my work, which is very much not OSS.

I can however try to paraphrase it so you get the gist of it.

The question asks to provide a SQL statement to update rows in table A based on related tables B and C, where table B is mentioned explicitly and C is implicit through the foreign keys provided in the context.

The key point all previous models I've tested has missed, is that the rows in A are many-to-one with B, and so the update should take this into account. This is implicit from the foreign key context and not mentioned directly in the question.

Think distributing pizza slices between a group of friends. All previous models has completely missed this part and just given each friend the whole pizza.

GPT-OSS correctly identified this issue and flagged it in the response, but also included a sensible assumption of evenly dividing the pizza.

I should note some of the previous models also missed the implicit connection to table C, and thus completely failed to do something sensible. But at least several of them figured this out. Of course I forgot to write that part down so can't say offhand which did what.

As for the code, for example I've coded a Y combinator in Delphi, using intentionally terse non-descriptive names, and asked the models to explain how the code works and what it does. Most ~7B models and larger of the past year or so have managed to explain it fairly well. However GPT-OSS was much more thorough and provider a much better explanation, showing a significantly better "understanding" of the code. It was also the first model smaller than LLama 3 70B that I've tried that correctly identified it as a Y combinator.

magicalhippo · 2025-08-08T07:45:05 1754639105

Here's a more concrete example where GPT-OSS 20B performed very well IMHO. I tested it against Gemma 3 12B, Phi 4 Reasoning 14B, Qwen 2.5-coder 14B.

The prompt is modeled as a part of an agent of sorts, and the "human" question is intentionally ill-posed to emulate people saying the wrong thing.

The prompt begins with asking the model to convert a question into matlab code, add any assumptions as comments at the start of the coder, or if it's not possible then output four hash marks followed by an reason why.

The (ill-posed) question is "What's the cutoff frequency for an LC circuit with R equals 500 ohm and C equals 10 nanofarrad?"

Gemma 3 took the bait and treated R as L and proceeded to calculate the cutoff frequency of an LC circuit[1], completely ignoring the resulting mismatch of units. It did not comment at all. Completely wrong answer.

Qwen 2.5-coder detected the ill-posed nature, but instead decided to substitute a dummy value for L before calculating the LC circuit answer. On the upside it did add the comments saying this, so acceptable in that regard.

Phi 4 Reasoning reasoned for about 3 minutes before deciding to assume the question is about an RC circuit. It added this as a comment, and correctly generated the code for an RC circuit. So good answer, but slow.

GPT-OSS reasoned for 14 seconds, and determined the question was ill posed, thus outputting the hash marks followed by The cutoff frequency of an LC circuit cannot be determined with only R and C provided; the inductance L is required. Good answer, and fast.

[1]: https://en.wikipedia.org/wiki/LC_circuit#Resonance_effect

Mkengin · 2025-08-08T08:22:51 1754641371

Why Qwen2.5 and not Qwen3-30B-A3B-Thinking-2507 or Qwen3-Coder-30B-A3B-Instruct?

magicalhippo · 2025-08-08T09:46:26 1754646386

Mostly because I had it downloaded already and I'm mostly interested in models that fit on my 16GB GPU. But since you asked, I ran the same questions through both 30B models in the q4_k_m variant, as GPT-OSS 20B is also quantized to about q4.

First the ill-posed question:

Qwen 3 Coder gave very similar answer to Phi 4, though included a more long-winded explanation in the comments. So not bad, but not great either.

Qwen 3 Thinking thought for a good minute before deciding the question was ill-posed and return the hash marks. However the following explanation was not as good as GPT-OSS, IMHO: The question is unclear because an LC circuit (without resistance) does not have a "cutoff frequency"; cutoff frequency applies to filter circuits like RC or RLC. Additionally, the inductance (L) value is missing for calculating resonant frequency in an RLC circuit. The given R and C values are insufficient without L.

Sure, an unloaded LC filter doesn't have a cutoff frequency, but in all normal cases the load is implied[1] and so the LC filter does have a cutoff frequency. So more thinking to get to a worse answer.

The SQL question:

Qwen 3 Coder did identify the same pitfall as GPT-OSS, however didn't flag it as clearly as GPT-OSS, mostly because it also flagged some unnecessary stuff so got drowned. It did make the same assumption about evenly dividing, and overall the answer was about as good. However the speed on my computer was roughly half the number of tokens per second as GPT-OSS, at just ~9 tokens/second.

Qwen 3 Thinking thought for 3 minutes, yet managed to miss the key aspect, thus giving everyone the pizza. And it did so at the same slow pace as Qwen 3 Coder.

The SQL question requires a somewhat large context due to the large table definitions, and being a larger model it required pushing more layers to the CPU, which I assume is the major factor in the speed drop.

So overall Qwen 3 Coder was a solid contender, but on my PC much slower. If it could run entirely on GPU I'd certainly try it a lot more. Interestingly Qwen 3 Thinking was just plain worse. Perhaps not tuned to other tasks besides coding?

[1]: https://www.ti.com/lit/an/slaa701a/slaa701a.pdf section 3.3 page 9

[2]: https://github.com/ollama/ollama/issues/11772

Mkengin · 2025-08-08T10:47:08 1754650028

Thank you for testing, I will test GPT-OSS for my use case as well. If you're interested I have 8 GB VRAM, 32 GB RAM and get around 21 token/s with tensor offloading, I would assume that your setup should be even faster than mine with the optimizations. I use the IQ4_KSS quant (by ubergarm on hf) with ik_llama.cpp with this command:

$env:LLAMA_SET_ROWS = "1"; ./llama-server -c 140000 -m D:\ik_llama.cpp\build\bin\Release\models\Qwen3-Coder-30B-A3B-Instruct-IQ4_KSS.gguf -ngl 999 --flash-attn -ctk q8_0 -ctv q8_0 -ot "blk\.(19|2[0-9]|3[0-9]|4[0-7])\.ffn_.*_exps\.=CPU" --temp 0.7 --top-p 0.8 --top-k 20 --repeat_penalty 1.05 --threads 8

In my case I offload layers 19-47, maybe you would just have to offload 37-47, so "blk\.(3[7-9]|4[0-7])\.ffn_.*_exps\.=CPU"

magicalhippo · 2025-08-08T11:33:21 1754652801

Yeah I think I could get better performance out of both by tweaking, but so far the ease of use has triumphed so far.