Hacker Newsnew | past | comments | ask | show | jobs | submit | jcuenod's commentslogin

I'm disappointed this isn't another T5Gemma model designed for translation. The big use case I see for this is fine-tuning. What are people using this for?

Comparison to GPT-OSS-20B (irrespective of how you feel that model actually performs) doesn't fill me with confidence. Given GLM 4.7 seems like it could be competitive with Sonnet 4/4.5, I would have hoped that their flash model would run circles around GPT-OSS-120B. I do wish they would provide an Aider result for comparison. Aider may be saturated among SotA models, but it's not at this size.

Hoping a 30-A3B runs circles around a 117-A5.1B is a bit hopeful thinking, especially when you’re testing embedded knowledge. From the numbers, I think this model excels at agent calls compared to GPT-20B. The rest are about the same in terms of performance

The benchmarks lie. I've been using using glm 4.7 and it's pretty okay with simple tasks but it's nowhere even near Sonnet. Still useful and good value but it's not even close.

> Given GLM 4.7 seems like it could be competitive with Sonnet 4/4.5

Not for code. The quality is so low, it's roughly on par with Sonnet 3.5


Can you compare this to Unsloth?


Posts like this really mean "this doesn't work like I expect it to based on my background with some other technology".

But in this case, I tried in earnest to use nextjs for a project with auth & stripe, etc. this past week, and I can't believe how frustrating it is to get stupid things like modal dialogs to work properly in the client.

I have tons of experience with React SPAs. But the client/server divide in Next remains quite inscrutable to me to the extent that I'm just going to start again with Django (where I nearly started it in the first place).

So yes, it doesn't work like I expect it to either...


Huh? Modals work the same way they do anywhere else.


My day job involves training language models (mostly seq2seq) for low-resource languages (with substantially less data than 2GB of data).

A few thoughts:

1. You can't cut off the embedding layer or discard the tokenizer without throwing out the model you're starting with. The attention matrices are applied to and trained with the token embedding layer.

2. Basically the same thing regarding the tokenizer. If you need to add some tokens, that can be done (or you can repurpose existing tokens) if your script is unique (a problem I face periodically). But if you are initializing weights for new tokens, that means those tokens are untrained. So if you do that for all your data, you're training a new model.

3. The Gemma model series sounds like a good fit for your use case. I'm not confident about Hebrew support, let alone Hasidic Yiddish, but it is relatively multilingual (more so than many other open models). Being multilingual means that the odds are greater than they have tokens relevant to your corpus that have been trained towards an optimal point for your dataset.

4. If you can generate synthetic data with synonyms or POS tags, then great. But this is a language model, so you need to think how you can usefully teach it natural sequences of text (not how to tag nouns or identify synonyms - I also did a bunch of classic NLP, and it's depressing how irrelevant all that work is these days). I suspect that repurposing this data will not be worth it. So, if anything, I'd recommend doing that as a second pass.

5. Take a look at unsloth notebooks for training a gemma3 model and load up your data. I reckon it'll surprise you how effective these models are...


I would _love_ to see more DIY mouse options. I feel like the mechanical keyboard crowd has so many options.

I've been dreaming of a set of lego-style bits of a mouse that can be assembled together... want another button? here you go. Want it on the side? Modify the 3D print file. Want bluetooth? Use this board... Want USB-C? Use that board... Want both? We've got you covered... Want a hyper-scroll wheel? Well, Logitech has a patent on that one, but here's the closest thing you can get on a DIY mouse. Now click these buttons in the configurator and hit "upload", and the firmware is installed to use your new mouse on any machine.


On the subject of adding more buttons, I think there needs to be a rethinking of mouse button events at the OS level. Gaming mice with 12-20+ buttons have to resort to creating keyboard events with weird key combinations because there aren't actually that many mouse events, which is insane. There are currently only 12 valid integers (12 types of "click") sent from the raw mouse events. Those need special handling because the numbers are chosen very strangely, but why can't we agree that for any number within some range, the odd number is a key-press and the even number is the key-release, or something like that? You don't have to create named events for all of them, but the raw integers should be valid even if you have to use the lower level events.

If I want to build a mouse with 32,000 buttons, the limit should not be the operating system's mouse event.


There's a YT channel called optimum and he made his perfect mouse and brought it up to a product stage. It may give you some ideas (like the sensor PCB is a set you can buy). https://youtu.be/oMUEsz71_xQ


Totally agree. Mouse preference is just as personal and maybe even more subjective than keyboard preference is, and even with the plethora of commercially available models, someone is going to be left settling because their needs aren’t quite met.


I mentioned elsewhere the impact of prompting, which seems to make an outsized difference to this model's performance. I tried NER and POS tagging (with somewhat disappointing results).

One thing that worked strikingly well was translation on non-Indo-European languages. Like I had success with Thai and Bahasa Indonesian -> English...


So I had a similar experience with your prompt (on the f16 model). But I do think that, at this size, prompting differences make a bigger impact. I had this experience trying to get it to list entities. It kept trying to give me a bulleted list and I was trying to coerce it into some sort of structured output. When I finally just said "give me a bulleted list and nothing else" the success rate went from around 0-0.1 to 0.8+.

In this case, I changed the prompt to:

---

Tallest mountains (in order):

```

- Mount Everest

- Mount K2

- Mount Sahel

- Mount Fuji

- Mount McKinley

```

What is the second tallest mountain?

---

Suddenly, it got the answer right 95+% of the time


Still pretty sad that its only 95% instead of 99%


This is great! I feel like there's been a resurgence of interest in language design and compilers of late. I have no business having an interest in this kind of thing, but even I have been inspired to try and make the changes to javascript that I think would improve it: https://chicory-lang.github.io/


82.2 on Aider

Still actually falling behind the official scores for o3 high. https://aider.chat/docs/leaderboards/


Does 82.2 correspond to the "Percent correct" of the other models?

Not sure if OpenAI has updated O3, but it looks like "pure" o3 (high) has a score of 79.6% in the linked table, "o3 (high) + gpt-4.1" combo has a the highest score of 82.7%.

The previous Gemini 2.5 Pro Preview 05-06 (yea, not current 06-05!) was at 76.9%.

That looks like a pretty nice bump!

But either way, these Aider benchmarks seem to be most useful/trustworthy benchmarks currently and really the only ones I'm paying attention to.


But so.much.cheaper.and.faster. Pretty amazing.


That's the older 05-06 preview, not the new one from today.


They knew that. The 82.2 comes from the new benchmarks in the OP not from the aider url. The aider url was supplied for comparison.


Ah, thanks for clearing that up!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: