More

georgewsinger · 2025-06-28T19:25:35 1751138735

Some explanations considered in this video:

1. Software engineering isn't "real" engineering.

2. Nobody cares about bugs when they write software in the first place.

3. Software is hard.

4. Software is early.

georgewsinger · 2025-05-13T22:36:07 1747175767

If you like reasoning about a program in terms of expression trees/graphs, I recently discovered that Wolfram Language has built-ins for this:

https://reference.wolfram.com/language/ref/ExpressionTree.ht...

georgewsinger · 2025-05-08T19:44:11 1746733451

This was such a great story.

Steve was a mischievous person himself, so surely a part of him respected this.

duxup · 2025-05-09T01:06:41 1746752801

My first real job my boss told me "Everyone fucks up, it's ok, when you do your first big fuck up... just be honest, and tell me."

3 years later I accidentally took down all the ATMs for one of the largest consumer banks in America for a while in the middle of the night.

My boss came in "Hey you finally did it, you took longer than most, but that was a good one!" and that was all that was ever said about it.

georgewsinger · 2025-05-06T15:02:53 1746543773

This is so cool. Real-time accent feedback is something language learners have never had throughout all of human history, until now.

Along similar lines, it would be useful to map a speaker's vowels in vowel-space (and likewise for consonants?) to compare native to non-native speakers.

I can't wait until something like this is available for Japanese.

yorwba · 2025-05-07T04:44:17 1746593057

The approach in the article is roughly equivalent to having someone listen to you speak and then repeating back in their own voice so you can attempt to copy their accent. Certainly nice to have available on demand without needing to coordinate schedules with another human.

A good accent coach would be able to do much better by identifying exactly how you're pronouncing things differently, telling you what you should be doing in your mouth to change that, and giving you targeted exercises to practice.

Presumably a model that predicts the position of various articulators at every timestamp in a recording could be useful for something similar.

pjc50 · 2025-05-06T15:23:20 1746545000

> something language learners have never had throughout all of human history

.. unless they had access to a native speaker and/or vocal coach? While an automated Henry Higgins is nifty, it's not something humans haven't been able to do themselves.

anadalakra · 2025-05-06T18:53:54 1746557634

Native speakers are less helpful at this than you might think. Speech coaches are absolutely the way to go, but they're outside the price range for most people ($200+/hr for a good coach). BoldVoice gives coach-level feedback and instruction at a price point that everyone can access, on demand.

yorwba · 2025-05-07T04:48:08 1746593288

Do you have another blog post showing your product giving targeted feedback about individual speech sounds? That's what I would expect from a coach.

anadalakra · 2025-05-07T18:22:40 1746642160

Not yet - this was our first technical blog post. You can check out the BoldVoice app and test out the sound-level feedback yourself. Or watch this app walkthrough video - https://www.youtube.com/watch?v=3Sv5K4Z9P4c

astrange · 2025-05-06T23:53:33 1746575613

You can take a language class rather than have a personal instructor. Although accents are a sensitive topic so I don't remember mine going into it much.

anadalakra · 2025-05-07T00:11:15 1746576675

As someone who took English classes for years growing up, I wish that were the case. In fact, most teachers don't really know how to teach pronunciation. Also, in a typical group class setting, it's challenging to give each student one-on-one feedback. On BoldVoice, we solve that with 1) unlimited instant feedback from sound-level AI - your most patient coach. 2) in-depth video lessons from the best coaches in the world (Hollywood accent coaches). I'm a cofounder of BoldVoice, by the way. :)

Muromec · 2025-05-07T00:15:43 1746576943

Language class and accent coaching are very different things.

astrange · 2025-05-07T06:41:12 1746600072

Try learning a language where they won't understand you with a foreign accent. I assume tonal languages are like this but haven't tried learning any.

Japanese is sort of like this - you have to say foreign words the Japanese way very forcibly, to the point that Americans will think you're being racist if they hear you do it.

ilyausorov · 2025-05-06T16:43:42 1746549822

That's a fascinating idea! Definitely something to try out for our team. We actively and continuously do all sorts of experiments with our machine learning models to be able to extract the most useful insights. We will definitely share if we find something useful here.

coherentpony · 2025-05-07T01:23:31 1746581011

> Real-time accent feedback is something language learners have never had throughout all of human history, until now.

Do you have a source for this? It doesn't seem plausible to me, but I'm not an expert.

georgewsinger · 2025-04-29T02:53:29 1745895209

What kinds of speed can be achieved, in terms of words per per minute?

tyleo · 2025-04-29T03:08:44 1745896124

I didn’t formally measure but if I had to guess I’d estimate I’m hitting ~40 words per minute on the last example.

georgewsinger · 2025-04-22T23:53:12 1745365992

Is this really how SOTA LLMs parse our queries? To what extent is this a simplified representation of what they really "see"?

helloplanets · 2025-04-23T05:39:22 1745386762

This is partly completely misleading and partly simplified, when it comes to SOTA LLMs.

Subject–Verb–Object triples, POS tagging and dependency structures are not used by LLMs. One of the fundamental differences between modern LLMs and traditional NLP is that heuristics like those are not defined.

And assuming that those specific heuristics are the ones which LLMs would converge on after training is incorrect.

jdspiral · 2025-04-23T02:04:21 1745373861

Yes, tokenization and embeddings are exactly how LLMs process input—they break text into tokens and map them to vectors. POS tags and SVOs aren't part of the model pipeline but help visualize structures the models learn implicitly.

georgewsinger · 2025-04-17T14:00:28 1744898428

Does anyone know if NOVAMIN based toothpastes (e.g. Sensodyne) have been tested?

sct202 · 2025-04-17T14:07:30 1744898850

The non-novamin Sensodyne was tested at 116ppb for lead and the tester listed the concerning ingredients: hydrated silica and titanium dioxide, which both are in the Sensodyne with Novamin tube I have from the UK.

SV_BubbleTime · 2025-04-17T14:11:38 1744899098

It says Sensodybe right in the article. However, typical Guardian, it is an article telling you what to think and not giving you all the information.

The question is, is there a safe level of lead, and are these tooth pastes under it?

more_corn · 2025-04-17T14:54:22 1744901662

Pretty sure the scientific consensus is there’s no safe level of lead exposure.

SV_BubbleTime · 2025-04-19T01:22:19 1745025739

That is incorrect.

The ideal amount is zero. But 1 part per 10 trillion is safe.

So… we have an estimated upper limit. Could we lower it without being able to detect a change in health effects? Likely.

pogue · 2025-04-17T14:34:45 1744900485

They only tested US toothpastes

pogue · 2025-04-17T14:34:10 1744900450

Novamin toothpaste is only sold & mfg in the UK. There are some conspiracy theories going around that the ingredient is so good they won't sell it to us in the US! [1]

I actually buy it off Amazon and use it myself because I have teeth sensitivity and it contains no SLS, which causes some irritation for me. It is quite interesting stuff. I doubt it would have lead since a synthetic compound. [2]

[1] https://medium.com/@ravenstine/the-curious-history-of-novami...

[2] https://en.wikipedia.org/wiki/Bioglass_45S5

bishopolis · 2025-04-20T08:34:43 1745138083

No, it's a trademark issue.

And, as Canada is a place, it has Novamin-based Sensodyne.

pogue · 2025-04-20T17:06:55 1745168815

Can you share some sources?

georgewsinger · 2025-04-17T13:57:18 1744898238

The author should make a meta-entry about how he makes the (insanely beautiful) diagrams in the book (ideally walking through the process).

psadauskas · 2025-04-17T14:04:52 1744898692

In the FAQ:

    07 How do you make the illustrations?
    By hand, in Figma. There's no secret - it's as complicated as it looks.

tonyhart7 · 2025-04-17T21:46:51 1744926411

using hand??? not after effect ?

god damn, that's some patient making animation right there

__loam · 2025-04-17T19:03:38 1744916618

Respect

behnamoh · 2025-04-17T21:22:07 1744924927

He has more content with figures on another platform: https://typefully.com/DanHollick

ivl · 2025-04-17T14:03:41 1744898621

The pair of animations on the page are beautifully done, not just technically but aesthetically as well. If the rest of the book is like that I'll be getting a copy.

georgewsinger · 2025-04-16T20:12:36 1744834356

Insane that people would downvote a totally reasonable comment offering a competing alternative. HN is supposed to be a community of tech builders.

throwaway314155 · 2025-04-16T20:19:50 1744834790

I would wager a sizeable chunk of the people here have no idea about the nature of this site's ownership/origin. This crowd finds this sort of thing to be a sort of astro-turfing - not communal.

edit: And I can't say I disagree.

groby_b · 2025-04-16T21:59:30 1744840770

It'a github link for an MIT-licensed project...

If the community considers that astroturfing, we have completely lost the plot what building is.

throwaway314155 · 2025-04-16T22:17:50 1744841870

The MIT license is basically the license of choice for growth hacking these days. Many VC backed companies follow this strategy - it serves to grow your userbase, a free-tier for developers using your ecosystem and last but not least, a chance for volunteers to do free work for you.

This is perhaps too cynical for this specific instance, but it's not overly cynical more broadly. Considering users of the site have to evaluate many of these offerings frequently, I don't blame them for having a negative gut reaction.

georgewsinger · 2025-04-16T17:20:19 1744824019

Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]

Incredible how resilient Claude models have been for best-in-coding class.

[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).

jjani · 2025-04-16T17:27:36 1744824456

Gemini 2.5 Pro is widely considered superior to 3.7 Sonnet now by heavy users, but they don't have an SWE-bench score. Shows that looking at one such benchmark isn't very telling. Main advantage over Sonnet being that it's better at using a large amount of context, which is enormously helpful during coding tasks.

Sonnet is still an incredibly impressive model as it held the crown for 6 months, which may as well be a decade with the current pace of LLM improvement.

unsupp0rted · 2025-04-16T17:37:23 1744825043

Main advantage over Sonnet is Gemini 2.5 doesn't try to make a bunch of unrelated changes like it's rewriting my project from scratch.

itsmevictor · 2025-04-16T17:45:12 1744825512

I find Gemini 2.5 truly remarkable and overall better than Claude, which I was a big fan of

enraged_camel · 2025-04-16T18:14:07 1744827247

Still doesn't work well in Cursor unfortunately.

ai-christianson · 2025-04-16T19:45:33 1744832733

Works well in RA.Aid --in fact I'd recommend it as the default model in terms of overall cost and capability.

plantain · 2025-04-16T22:20:49 1744842049

Working fine here. What problems do you see?

michaelbarton · 2025-04-16T23:38:27 1744846707

Not the OP but believe they could be referring to the fact it’s not supported in edit mode yet, only agent mode.

So far for me that’s not been too much of a roadblock. Though I still find overall Gemini struggles with more obscure issues such as SQL errors in dbt

pdntspa · 2025-04-17T03:35:19 1744860919

Cline/Roo Code work fine with it

bitbuilder · 2025-04-16T18:34:10 1744828450

This was incredibly irritating at first, though over time I've learned to appreciate this "extra credit" work. It can be fun to see what Claude thinks I can do better, or should add in addition to whatever feature I just asked for. Especially when it comes to UI work, Claude actually has some pretty cool ideas.

If I'm using Claude through Copilot where it's "free" I'll let it do its thing and just roll back to the last commit if it gets too ambitious. If I really want it to stay on track I'll explicitly tell it in the prompt to focus only on what I've asked, and that seems to work.

And just today, I found myself leaving a comment like this: //Note to Claude: Do not refactor the below. It's ugly, but it's supposed to be that way.

Never thought I'd see the day I was leaving comments for my AI agent coworker.

TuxSH · 2025-04-16T18:58:02 1744829882

> If I'm using Claude through Copilot where it's "free"

Too bad Microsoft is widely limiting this -- have you seen their pricing changes?

I also feel like they nerfed their models, or reduced context window again.

Aeolun · 2025-04-16T23:51:14 1744847474

Claude is almost comically good outside of copilot. When using through copilot it’s like working with a lobotomized idiot (that complains it generated public code about half the time).

TuxSH · 2025-04-19T10:20:17 1745058017

It used to be good, or at least quite decent in GH Copilot, but it all turned into poop (the completions, the models, everything) ever since they announced the pricing changes.

Considering that M$ obviously trains over GitHub data, I'm a bit pissed, honestly, even if I get GH Copilot Pro for free.

erikw · 2025-04-16T19:12:53 1744830773

What language / framework are you using? I ask because in a Node / Typescript / React project I experience the opposite- Claude 3.7 usually solves my query on the first try, and seems to understand the project's context, ie the file structure, packages, coding guidelines, tests, etc, while Gemini 2.5 seems to install packages willy-nilly, duplicate existing tests, create duplicate components, etc.

unsupp0rted · 2025-04-17T11:04:22 1744887862

Node / Vue

jdgoesmarching · 2025-04-16T17:48:04 1744825684

Also that Gemini 2.5 still doesn’t support prompt caching, which is huge for tools like Cline.

scrlk · 2025-04-16T18:00:35 1744826435

2.5 Pro supports prompt caching now: https://cloud.google.com/vertex-ai/generative-ai/docs/models...

jdgoesmarching · 2025-04-16T18:07:53 1744826873

Oh, that must’ve been in the last few days. Weird that it’s only in 2.5 Pro preview but at least they’re headed in the right direction.

Now they just need a decent usage dashboard that doesn’t take a day to populate or require additional GCP monitoring services to break out the model usage.

Workaccount2 · 2025-04-16T17:52:48 1744825968

It's viable context, context length where is doesn't fall apart, is also much longer.

zaptrem · 2025-04-16T18:10:14 1744827014

I do find it likes to subtly reformat every single line thereby nuking my diff and making its changes unusable since I can’t verify them that way, which Sonnet doesn’t do.

armen52 · 2025-04-16T21:12:58 1744837978

I don't understand this assertion, but maybe I'm missing something?

Google included a SWE-bench score of 63.8% in their announcement for Gemini 2.5 Pro: https://blog.google/technology/google-deepmind/gemini-model-...

amedviediev · 2025-04-17T19:14:54 1744917294

I keep seeing this sentiment so often here and on X that I have to wonder if I'm somehow using a different Gemini 2.5 Pro. I've been trying to use it for a couple of weeks already and without exaggeration it has yet to solve a single programming task successfully. It is constantly wrong, constantly misunderstands my requests, ignores constraints, ignores existing coding conventions, breaks my code and then tells me to fix it myself.

spaceman_2020 · 2025-04-16T19:24:22 1744831462

I feel that Claude 3.7 is smarter, but does way too much and has poor prompt adherence

redox99 · 2025-04-17T02:26:22 1744856782

2.5 Pro is very buggy with cursor. It often stops before generating any code. It's likely a cursor problem, but I use 3.7 because of that.

saberience · 2025-04-17T10:08:41 1744884521

Eh, I wouldn't say that's accurate, I think it's situational. I code all day using AI tools and Sonnet 3.7 is still the king. Maybe it's language dependent or something, but all the engineers I know are full on Claude-Code at this point.

pizzathyme · 2025-04-16T19:10:52 1744830652

The image generation improvement with o4-mini is incredible. Testing it out today, this is a step change in editing specificity even from the ChatGPT 4o LLM image integration just a few weeks ago (which was already a step change). I'm able to ask for surgical edits, and they are done correctly.

There isn't a numerical benchmark for this that people seem to be tracking but this opens up production-ready image use cases. This was worth a new release.

mchusma · 2025-04-16T21:04:14 1744837454

Thanks for sharing that. that was more interesting then their demo. I tried it and it was pretty good! I have felt that the ability to iterate from images blocked this from any real production use I had. This may be good enough now.

Example of edits (not quite surgical but good): https://chatgpt.com/share/68001b02-9b4c-8012-a339-73525b8246...

ec109685 · 2025-04-17T01:24:00 1744853040

I don’t know if they let you share the actual images when sharing a chat. For me, they are blank.

ilaksh · 2025-04-16T21:23:33 1744838613

wait, o4-mini outputs images? What I thought I saw was the ability to do a tool call to zoom in on an image.

Are you sure that's not 4o?

AaronAPU · 2025-04-16T23:11:34 1744845094

I’m generating logo designs for merch via o4-mini-high and they are pretty good. Good text and comprehending my instructions.

ilaksh · 2025-04-17T04:39:51 1744864791

It's using the new gpt-4o, a version that's not in the API

ilaksh · 2025-04-17T04:38:06 1744864686

in the api or on the website?

Agentus · 2025-04-16T23:08:02 1744844882

also another addition: i previously tried to upload an image for chatgpt to edit and it was incapable under the previous model i tried. Now its able to change uploaded images using o4mini.

oofbaroomf · 2025-04-16T17:23:45 1744824225

Claude got 63.2% according to the swebench.com leaderboard (listed as "Tools + Claude 3.7 Sonnet (2025-02-24)).[0] OpenAI said they got 69.1% in their blog post.

[0] swebench.com/#verified

georgewsinger · 2025-04-16T17:42:03 1744825323

Yes, however Claude advertised 70.3%[1] on SWE bench verified when using the following scaffolding:

> For Claude 3.7 Sonnet and Claude 3.5 Sonnet (new), we use a much simpler approach with minimal scaffolding, where the model decides which commands to run and files to edit in a single session. Our main “no extended thinking” pass@1 result simply equips the model with the two tools described here—a bash tool, and a file editing tool that operates via string replacements—as well as the “planning tool” mentioned above in our TAU-bench results.

Arguably this shouldn't be counted though?

[1] https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...

tedsanders · 2025-04-16T18:08:10 1744826890

I think you may have misread the footnote. That simpler setup results in the 62.3%/63.7% score. The 70.3% score results from a high-compute parallel setup with rejection sampling and ranking:

> For our “high compute” number we adopt additional complexity and parallel test-time compute as follows:

> We sample multiple parallel attempts with the scaffold above

> We discard patches that break the visible regression tests in the repository, similar to the rejection sampling approach adopted by Agentless; note no hidden test information is used.

> We then rank the remaining attempts with a scoring model similar to our results on GPQA and AIME described in our research post and choose the best one for the submission.

> This results in a score of 70.3% on the subset of n=489 verified tasks which work on our infrastructure. Without this scaffold, Claude 3.7 Sonnet achieves 63.7% on SWE-bench Verified using this same subset.

georgewsinger · 2025-04-16T19:46:44 1744832804

Somehow completely missed that, thanks!

I think reading this makes it even clearer that the 70.3% score should just be discarded from the benchmarks. "I got a 7%-8% higher SWE benchmark score by doing a bunch of extra work and sampling a ton of answers" is not something a typical user is going to have already set up when logging onto Claude and asking it a SWE style question.

Personally, it seems like an illegitimate way to juice the numbers to me (though Claude was transparent with what they did so it's all good, and it's not uninteresting to know you can boost your score by 8% with the right tooling).

ianbutler · 2025-04-17T01:39:24 1744853964

It isn't on the benchmark https://www.swebench.com/#verified

The one on the official leaderboard is the 63% score. Presumably because of all the extra work they had to do for the 70% score.

awestroke · 2025-04-16T17:41:10 1744825270

OpenAI have not shown themselves to be trustworthy, I'd take their claims with a few solar masses of salt

swyx · 2025-04-16T17:50:26 1744825826

they also gave more detail on their SWEBench scaffolding here https://www.latent.space/p/claude-sonnet

lattalayta · 2025-04-16T17:41:23 1744825283

I haven't been following them that closely, but are people finding these benchmarks relevant? It seems like these companies could just tune their models to do well on particular benchmarks

mickael-kerjean · 2025-04-17T01:52:29 1744854749

The benchmark is something you can optimize for, doesn't mean it generalize well. Yesterday I tried for 2 hours to get claude to create a program that would extract data from a weird adobe file. 10$ later, the best I had is a program that was doing something like:

  switch(testFile) {
    case "test1.ase": // run this because it's a particular case 
    case "test2.ase": // run this because it's a particular case
    default:  // run something that's not working but that's ok because the previous case should
              // give the right output for all the test files ...
  }

emp17344 · 2025-04-16T17:56:48 1744826208

That’s exactly what’s happening. I’m not convinced there’s any real progress occurring here.

knes · 2025-04-17T04:42:49 1744864969

Right now the Swe-Bench leader Augment Agent still use Claude 3.7 in combo with o1. https://www.augmentcode.com/blog/1-open-source-agent-on-swe-...

The findings are open sourced on a repo too https://github.com/augmentcode/augment-swebench-agent

thefourthchime · 2025-04-16T18:06:02 1744826762

Also, if you're using Cursor AI, it seems to have much better integration with Claude where it can reflect on its own things and go off and run commands. I don't see it doing that with Gemini or the O1 models.

ksec · 2025-04-17T13:12:07 1744895527

I often wonder if we could expect that to reach 80% - 90% within next 5 years.