Hacker News new | past | comments | ask | show | jobs | submit | inerte's comments login

Or that he does understand, but avoids calling it a tax because Americans in general hate that? It's wordsmithing politics 101.

Guy is a billionaire twice president of the US. He's many things (imho really bad things), but not ignorant.


He wasn't avoiding "calling it a tax", he said calling it that was fake news and that everyone else but US would be paying for it.

Suggesting he was just tippy-toeing around the language is completely whitewashing his obvious widespread lies.


>Guy is a billionaire twice president of the US. He's many things (imho really bad things), but not ignorant.

I'm no Trump fan. Never voted for him. Really really disagree with his politics, but here's some things I think are observably true.

- He understands media

- He understands very well how to manipulate public sentiment and opinion

- He understands the power of celebrity

- He understands the importance of tailoring to audiences

- He's willing to say and do things others aren't. There's no line in the sand for him

You can be otherwise ignorant of a great many things (including tariffs, which he has a demonstrated lack of understanding going back decades) and still get elected based on these traits


A few of you will die of hunger, but these are taxes I am not willing to pay.

Not sure if I would tradeoff speed for accuracy.

Yes, it's incredible boring to wait for the AI Agents in IDEs to finish their job. I get distracted and open YouTube. Once I gave a prompt so big and complex to Cline it spent 2 straight hours writing code.

But after these 2 hours I spent 16 more tweaking and fixing all the stuff that wasn't working. I now realize I should have done things incrementally even when I have a pretty good idea of the final picture.

I've been more and more only using the "thinking" models of o3 in ChatGPT, and Gemini / Claude in IDEs. They're slower, but usually get it right.

But at the same time I am open to the idea that speed can unlock new ways of using the tooling. It would still be awesome to basically just have a conversation with my IDE while I am manually testing the app. Or combine really fast models like this one with a "thinking background" one, that would runs for seconds/minutes but try to catch the bugs left behind.

I guess only giving a try will tell.


So my personal belief is that diffusion models will enable higher degrees of accuracy. This is because unlike an auto-regressive model it can adjust a whole block of tokens when it encounters some kind of disjunction.

Think of the old example where an auto regressive model would output: "There are 2 possibilities.." before it really enumerated them. Often the model has trouble overcoming the bias and will hallucinate a response to fit the proceeding tokens.

Chain of thought and other approaches help overcome this and other issues by incentivizing validation, etc.

With diffusion however it is easier for the other generated answer to change that set of tokens to match the actual number of possibilities enumerated.

This is why I think you'll see diffusion models be able to do some more advanced problem solving with a smaller number of "thinking" tokens.


Unfortunately the intuition and the math proofs so far suggest that autoregressive training is learning the joint distribution of probabilistic streams of tokens much better than diffision models do or will ever do. My intuitive take is that the conditional probability distribtion of decoder-only autoregressive models is at just the right level of complexity for probabilistic models to learn accurately enough. Intuitively (and simplifying things at the risk of breaking rigor), the diffusion (or masked models) have to occasionally issue tokens with less information and thus higher variance than a pure autoregressive model would have to do, so the joint distribution, ie the probability of the whole sentence/answer will be lower and thus diffusion models will never get precise enough. Of course, during generation the sampling techniques influence the above simplified idea dramatically and the typical randomized sampling for next token prediction is suboptimal and could be beaten by a carefully designed block diffusion sampler in principle in some contexts though I havent seen real examples of it yet. But the key ideas of the above scribbles are still valid: autoregresive models will always be better (or at least equal) probabilistic models of sequential data than diffusion models will be. So the diffusion models mostly offer a tradeoff for performance vs quality. Sometimes there is a lot of room for that tradeoff in practice.


This is tremendously interesting!

Could you point me to some literature? Especially regarding mathematical proofs of your intuition?

I’d like to recalibrate my priors to align better with current research results.


From the mathematical point of view the literature is about the distinction between a "filtering" distribution and a "smoothing" distribution. The smoothing distribution is strictly more powerful.

In theory intuitively the smoothing distribution has access to all the information that the filtering distribution has and some additional information therefore has a minimum lower than the filtering distribution.

In practice, because the smoothing input space is much bigger, keeping the same number of parameters we may not reach a better score because with diffusion we are tackling a much harder problem (the whole problem), whereas with autoregressive models we are taking a shortcut which happens to probably be one that humans are probably biased too (communication evolved so that it can be serialized to be exchanged orally).


Although what you say about smoothing vs filtering is true in principle, for conditional generation of the eventual joint distribution starting from the same condition and using an autoregresive vs diffusive LLM, it is the smoothing distribution that has less power. In other words, during inference starting from J tokens and writing token number K is of course better with diffusion if you also have some given tokens after token K and up to the maximal token N. However, if your input is fixed (tokens up to J) and you have to predict those additional tokens (from J+1 to N), you are solving a harder problem and have a lower joint probability at the end of the inference for the full generated sequence from J+1 up to N.


I am still jetlagged and not sure what the most helpful reference would be. Maybe start from the block diffusion paper I recommended in a parallel thread and trace your way up/down from there. The logic leading to Eq 6 is a special case of such a math proof.

https://openreview.net/forum?id=tyEyYT267x


What are the barriers to mixed architecture models? Models which could seamlessly pass from autoregressive to diffusion, etc.

Humans can integrate multiple sensory processing centers and multiple modes of thought all at once. It's baked into our training process (life).


The human processing is still autoregressive, but using multiple parallel synchronized streams. There is no problem with such an approach and my best guess is that in the next year we will see many teams training models using such tricks for generating reasoning traces in parallel.

The main concern is taking a single probabilistic stream (eg a book) and comparing autoregressive modelling of it with a diffusive modelling of it.

Regarding mixing diffusion and autoregressive—I was at ICLR last week and this work is probably relevant: https://openreview.net/forum?id=tyEyYT267x


Maybe diffusion for "thoughts" and autoregressive for output :S


Suggests an opportunity for hybrids, where the diffusion model might be responsible for large scale structure of response and the next token model for filling in details. Sort of like a multi scale model in dynamics simulations.


> it can adjust a whole block of tokens when it encounters some kind of disjunction.

This is true in principle for general diffusion models, but I don't think it's true for the noise model they use in Mercury (at least, going by a couple of academic papers authored by the Inception co-founders.) Their model generates noise by masking a token, and once it's masked, it stays masked. So the reverse-diffusion gets to decide on the contents of a masked token once, and after that it's fixed.


Here are two papers linked from Inception's site:

1. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution - https://arxiv.org/abs/2310.16834

2. Simple and Effective Masked Diffusion Language Models - https://arxiv.org/abs/2406.07524


Thanks, yes, I was thinking specifically of "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution". They actually consider two noise distributions: one with uniform sampling for each noised token position, and one with a terminal masking (the Q^{uniform} and Q^{absorb}.) However, the terminal-masking system is clearly superior in their benchmarks.

https://arxiv.org/pdf/2310.16834#page=6


The exact types of path dependencies in inference on text-diffusion models look like an interesting research project.


Yes, the problem is coming up with a noise model where reverse diffusion is tractable.


Thank you, I'll have to read the papers. I don't think I have read theirs.


Once that auto-regressive model goes deep enough (or uses "reasoning"), it actually has modeled what possibilities exist by the time it's said "There are 2 possibilities.."

We're long past that point of model complexity.


But as everyone knows, computer science has two hard problems: naming things, cache invalidation, and off by one errors.


Check out RooCode if you haven’t. There’s an orchestrator mode that can start with a big model to come up with a plan and break down, then spin out small tasks to smaller models for scoped implementation.


If you’re open to a terminal-based approach, this is exactly what my project Plandex[1] focuses on—breaking up and completing large tasks step by step.

1 - https://github.com/plandex-ai/plandex


Wouldn't it be possible to trade speed back for accuracy, e.g. by asking the model to look at a problem from different angles, let it criticize its own output, etc.?


Just have it sample for longer or create a simple workflow that uses a monte carlo tree search approach. Don't see why this wont improve accuracy. I would love to see someone run tests to see how accurate the model is compared to similar parameter models in a per time block benchmark. Like if it can get same accuracy as a similar parameter autoregressive model but with half the speed, you already have a winner, besides other advantages of a diffusion based model.


AI field desperately needs smarter models - not faster models.


Definitely needs faster and cheaper models. Fast and cheap models could replace software in tons of situations. Imagine a vending machine or a mobile game or a word processor where basically all logic is implemented as a prompt to an llm. It would serve as the ultimate high level programming language.


I think natural language to code is the right abstraction. Easy enough barrier to entry but still debuggable. Debugging why an LLM randomly gives you Mountain Dew instead of Sprite if you have a southern accent sounds like a nightmare.


I'm not sure it would be that hard to debug. Make sure you can reproduce the llm state (by storing the random seed for the session, or something like that) and then ask it "why did you just now give that customer mountain dew when they ordered sprite?"


> and then ask it "why did you just now give that customer mountain dew when they ordered sprite?"

Worse than useless for debugging.

An LLM can't think and doesn't have capabilities for self-reflection.

It will just generate a plausible stream of tokens in reply that may or may not correspond to the real reason why.


Of course a llm can't think. But that doesn't mean it can't answer simple questions about the output that was produced. Just try it out with chatgpt when you have time. Even if it's not perfectly accurate it's still useful for debugging.

Just think about it as a human employee. Can they always say why they did what they did? Often, but not always. Sometimes you will have to work to figure out the misunderstanding.


> it's still useful for debugging

How so? What the LLM says is whatever is more likely given the context. It has no relation to the underlying reality whatsoever.


Not sure what you mean by "relation to the underlying reality". The explanation is likely to be correlated with the underlying reason for the answer.

For example, here is a simple query:

> I put my bike in the bike stand, I removed my bike from the bike stand and biked to the beach, then I biked home, where is my bike. Answer only with the answer and no explanation

> Chatgpt: At home

> Why did you answer that?

> I answered that your bike is at home because the last action you described was biking home, which implies you took the bike with you and ended your journey there. Therefore, the bike would logically be at home now.

Do you doubt that the answer would change if I changed the query to make the final destination be "the park" instead of "home"? If you don't doubt that, what do you mean that the answer doesn't correspond to the underlying reality? The reality is the answer depends on the final destination mentioned, and that's also the explanation given by the LLM, clearly the reality and the answers are related.


You need to find an example of the LLM making a mistake. In your example, ChatGPT answered correctly. There are many examples online of LLMs answering basic questions incorrectly, and then the person asking the LLM why it did so. The LLM response is usually nonsense.

Then there is the question of what you would do with its response. It’s not like code where you can go in and update the logic. There are billions of floating point numbers. If you actually wanted to update the weights you’ll quickly find yourself fine-tuning the monstrosity. Orders of magnitude more work than updating an “if” statement.


I don't think llms always can give correct explanations for their answers. That's a misunderstanding.

> Then there is the question of what you would do with its response. I

Sure but that's a separate question. I'd say the first course of action would be to edit the prompt. If you have to resort to fine tuning I'd say the approach has failed and the tool was insufficient for the task.


It’s not really a separate question imo. We want to know whether computer code or prompts are better for programming things like vending machines.

For LLMs, interpretability is one problem. The ability to effectively apply fixes is another. If we are talking about business logic, have the LLM write code for it and don’t tie yourself in knots begging the LLM to do things correctly.

There is a grey area though, which is where code sucks and statistical models shine. If your task was to differentiate between a cat and a dog visually, good luck writing code for that. But neural nets do that for breakfast. It’s all about using the right tool for the job.


> The explanation is likely to be correlated with the underlying reason for the answer.

No it isn't. You misunderstand how LLMs work. They're giant Mad Libs machines: given these surrounding words, fill in this blank with whatever statistically is most likely. LLMs don't model reality in any way.


Did you read the example above? Do you disagree that the LLM provided a correct explanation for the reason it answered as it did?

> They're giant Mad Libs machines: given these surrounding words, fill in this blank with whatever statistically is most likely. LLMs don't model reality in any way.

Not sure why you think this is incompatible with the statement you disagreed with.


> Do you disagree that the LLM provided a correct explanation for the reason it answered as it did?

Yes, I do. An LLM replies with the most likely string of tokens. Which may or may not correspond with the correct or reasonable string of tokens, depending on how stars align. In this case the statistically most likely explanation the LLM replied with just happened to correspond with the correct one.


> In this case the statistically most likely explanation the LLM replied with just happened to correspond with the correct one.

I claim that case is not so uncommon as people in this thread seem to think


Why not just store the state in the code and debug as usual, perhaps with LLM assistance? At least that’s tractable.


Why on earth would you implement a vending machine using an LLM?


The same reason we make the butter dish suffer from existential angst.


Because it's easy and cheap. Like how many products use a Raspberry Pi or ESP32 when an ATtiny would do.


How in the world is this easy and cheap? Are you planning to run this LLM inside the vending machine? Or are you planning to send those prompts to a remote LLM somewhere?


The premise here is that the model runs fast and cheap. With the current state of the technology running a vending machine using an LLM is of course absurd. The point is that accuracy is not the only dimension that brings qualitative change to the kind of applications that LLMs are useful for.


Running a vending machine using an LLM is absurd not because we can't run LLMs fast or cheap enough - it's because LLMs are not reliable, and we don't know yet how to make them more reliable. Our best LLM - o3 - doubled the previous model (o1) hallucination rate. OpenAI says it hallucinated a wrong answer 33% of the time in benchmarks. Do you want a vending machine that screws up 33% of the time?

Today, the accuracy of LLMs is by far a bigger concern (and a harder problem to solve) than its speed. If someone releases a model which is 10x slower than o3, but is 20% better in terms of accuracy, reliability, or some other metric of its output quality, I'd switch to it in a heartbeat (and I'd be ready to pay more for it). I can't wait until o3-pro is released.


Do you seriously think a typical contemporary LLM would screw up 33% of vending machine orders?

I don't know what benchmark you're looking at but I'm sure the questions in it were more complicated than the logic inside a vending machine.

Why don't you just try it out? It's easy to simulate, just tell the bot about the task and explain to it what actions to perform in different situations, then provide some user input and see if it works or not.


You could run a 3B model on 200 dollars worth of hardware and it would do just fine, 100 percent of the time, most of the time. I could definitely see someone talking it out of a free coke now and then though.

With vending machines costing 2-5k, it’s not out of the question, but it’s hard to imagine the business case for it. Maybe the tantalizing possibility of getting a free soda would attract traffic and result in additional sales from frustrated grifters? Idk.


Yet deepseek has shown that more dialogue increases quality. Increasing speed is therefore important if you need thinking models.


If you have much more speed in the available time, for an activity like coding, you could use that for iteration, writing more tests and satisfying them, especially if you can pair that with a concurrent test runner to provide feedback. I'm not sure the end result would be lower scoring/smartness than an LLM could achieve in the same duration.


I'm not sure the end result would be lower scoring/smartness than an LLM could achieve in the same duration.

It probably wouldn’t with current models. That’s exactly why I said we need smarter models - not more speed. Unless you want to “use that for iteration, writing more tests and satisfying them, especially if you can pair that with a concurrent test runner to provide feedback.” - I personally don’t.


LLM's can't think, so "smarter" is not possible.


They can by the normal English definitions of "think" and "smart". You're just redefining those words to exclude AI because you feel threatened by it. It's tedious.


Incorrect. LLM's have no self-reflection capability. That's a key prerequisite for "thinking". ("I think, therefore I am.")

They are simple calculators that answer with whatever tokens are most likely given the context. If you want reasonable or correct answers (rather than the most likely) then you're out of luck.


It is not a key prerequisite for "thinking". It's "I think therefore I am" not "I am self-aware therefore I think".

In the 90s if your cursor turned into an hourglass and someone said "it's thinking" would you have pedantically said "NO! It is merely calculating!"

Maybe you would... but normal people with normal English would not.


Self-reflection is not the same thing as self-awareness.

Computers have self-reflection to a degree - e.g., they react to malfunctions and can evaluate their own behavior. LLMs can't do this, in this respect they are even less of a thinking machine than plain old dumb software.


Technically correct and completely besides the point.


People cant fly.


I think speed and convenience are essential. I use chat gpt desktop for coding. Not because it's the best but because it's fast and easy and doesn't interrupt my flow too much. I mostly stick to the 4o model. I only use the o3 model when I really have to. Because at that point getting an answer is slooooow. 4o is more than good enough most of the time.

And more importantly it's a simple option+shift+1 away. I simply type something like "fix that" and it has all the context it needs to do its thing. Because it connects to my IDE and sees my open editor and the highlighted line of code that is bothering me. If I don't like the answer, I might escalate to o3 sometimes. Other models might be better but they don't have the same UX. Claude desktop is pretty terrible, for example. I'm sure the model is great. But if I have to spoon feed it everything it's going to annoy me.

What I'd love is for smaller faster models to be used by default and for them to escalate to slower more capable models on a need to have basis only. Using something like o3 by default makes no sense. I don't want to have to think about which model is optimal for what question. The problem of figuring out what model is best to use is a much simpler one than answering my questions. And automating that decision opens the doors to having a multitude of specialized models.


You're missing that Claude desktop has MCP servers, which can extend it to do a lot more, including much better real life "out of the box" uses. You can do things like use Obsidian as a filesystem or connect to local databases to really extend the abilities. You can also read and write to github directly and bring in all sorts of other tools.


> Not sure if I would tradeoff speed for accuracy.

Are you, though?

There are obvious examples of obtaining speed without losing accuracy, like using a faster processor with bigger caches, or more processors.

Or optimizing something without changing semantics, or the safety profile.

Slow can be unreliable; a 10 gigabit ethernet can be more reliable than a 110 baud acoustically-coupled modem in mean time between accidental bit flips.

Here, the technique is different, so it is apples to oranges.

Could you tune the LLM paradigm so that it gets the same speed, and how accurate would it be?


> I now realize I should have done things incrementally even when I have a pretty good idea of the final picture.

Or just save yourself the time and money and code it yourself like it's 2020.

(Unless it's your employer paying for this waste, in which case go for it, I guess.)


You left an LLM to code for two hours and then were surprised when you had to spend a significant amount of time more cleaning up after it?

Is this really what people are doing these days?


Accuracy is a myth.

These models do not reason. They do not calculate. They perform no objectivity whatsoever.

Instead, these models show us what is most statistically familiar. The result is usually objectively sound, or at least close enough that we can rewrite it as something that is.


I don't use the best available models for prototyping because it can be expensive or more time consuming. This innovation makes prototyping faster and practicing prompts on slightly lower accuracy models can provide more realistic expectations.


The excitement for me is the implications for lower energy models. Tech like this could thoroughly break the Nvidia stranglehold at least for some segments


Choosing a certain type of economic theory or having certain sectors of the economy do better than others is 100% politics. I don’t think there is an economic theory where everybody benefits equally around the same time without any downsides.


How the heck is not? Computers are looking into screenshots and searching the internet to support their "thinking", that's amazing! Have we become so used to AI and what was impossible 6 months ago is shruggable today?

I've being doing this MIND-Dash diet lately and it's amazing I can just take a picture of whatever (nutritional info / ingredients are perfect for that) and just ask if it fits my plan and it tells me back into what bucket it falls, with detailed breakdown of macros in support of some additional goals I have (muscle building for powerlifting). It's amazing! And it does in 2 minutes passively what it would take me 5-10 active search.


I fully expect that someday the news will announce, "The AI appears to be dismantling the moons of Jupiter and turning them into dense, exotic computational devices which it is launching into low solar orbit. We're not sure why. The AI refused to comment."

And someone will post, "Yeah, but that's just computer aideded design and manufacturing. It's not real AI."

The first rule of AI is that the goalposts always move. If a computer can do it, by definition, it isn't "real" AI. This will presumably continue to apply even as the Terminator kicks in the front door.


Yes, but I choose to interpret that as a good thing. It is good that progress is so swift and steady that we can afford to keep moving the goalposts.

Take cars as a random example: progress there isn't fast enough that we keep moving the goalposts for eg fuel economy. (At least not nearly as much.) A car with great fuel economy 20 years ago is today considered at least still good in terms of fuel economy.


And if you account for the makeup of the fleet on the road overall, a great fuel economy car from 1995 (say, a Prizm), still beats the median vehicle on the road, which is certainly an SUV weighing twice as much and gets worse mileage.


In the same way a calculator performing arithmetic faster than humans isn't impressive. The same way running regex over a million lines and the computer beating a human in search isn't impressive


Neither is impressive solely because we've gotten used to them. Both were mind-blowing back in the day.

When it comes to AI - and LLMs in particular - there’s a large cohort of people who seem determined to jump straight from "impossible and will never happen in our lifetime" to "obvious and not impressive", without leaving any time to actually be impressed by the technological achievement. I find that pretty baffling.


I agree, but without removing search you cannot decouple. Has it embedded a regex method and is just leveraging that? Or is it doing something more? Yes, even the regex is still impressive but it is less impressive that doing something more complicated and understanding context and more depth.


I think both are very impressive, world shattering capabilities. Just because they have become normalized doesn't make it any less impressive in my view.


That's a fair point, and I would even agree. Though I think we could agree that it is fair to interpret "impressive" in this context as "surprising". There's lots of really unsurprising things that are incredibly impressive. But I think the general usage of the word here is more akin to surprisal.


Yeah, it's a funny take because this is in fact a more advanced form of AI with autonomous tool use that is just now emerging in 2025. You might say "They could search the web in 2024 too" but that wasn't autonomous on its own, but required telling so or checking a box. This one is piecing ideas together like "Wait, I should Google for this" and that is specifically a new feature for OpenAI o3 that wasn't even in o1.

While it isn't entirely in the spirit of GeoGuesser, it is a good test of the capabilities where being great at GeoGuesser in fact becomes the lesser news here. It will still be if disabling this feature.


There’s another future where reasoning models get better with larger context windows, and you can throw a new programming language or framework at it and it will do a pretty good job.


I know it’s a marketing case study, but:

> Ever wondered how NASA identifies its top experts, forms high-performing teams, and plans for the skills of tomorrow?

Here’s another resource on that https://appel.nasa.gov/2010/02/18/aa_2-7_f_nasa_teams-html/ the book “How NASA Builds Teams: Mission Critical Soft Skills for Scientists, Engineers, and Project Teams”


https://www.axios.com/2025/04/19/inside-trump-mindset-tariff...

"We saw it in business with Trump," one adviser said. "He would have these meetings and everyone would agree, and then we would just pray that when he left the office and got on the elevator that the doorman wouldn't share his opinion, because there would be a 50/50 chance [Trump] would suddenly side with the doorman."


Trump is known to be that type where if you want him to go with your suggestion, you must be the last person in the room.


Perhaps that's why Elon was following him so closely everywhere at the tail end of his campaign and at the start of his presidency.


Yes, my thoughts at the end of the article. If the AI coding is really good (or will be really, really good) you could give 6 figures salary + $5/d in OpenAI credits to a Bay Area developer, OR you give $5/d salary + $5/d in OpenAI credits to someone else from another country.

That's what happened to manufacturing after all.


150 dollars/month as salary won't get you no one from no country and if it happens to, the person will have so many things to figure out (war, hunger, political instability) that they would obviously not be productive.


Thing is, manufacturing physical goods mean you have to physically move them around. Digital goods don't have that problem. Timezones are what's proving to be challenging though.


100%. You can offshore "please write code doing X for me" but it's much harder to offshore "please generate value for my customers with this codebase" which is a lot closer to what software engineers actually do.

Therefore, I do not anticipate a massive offshoring of software like what happened in manufacturing. Yet, a lot of software work can be fully specified and will be outsourced.


aka https://en.wikipedia.org/wiki/No_Silver_Bullet

And it's also interesting to think that PMs are also using AI - in my company for example we allow users to submit feedback, then there's an AI summary report sent to PMs. Which them put the report into ChatGPT with the organizational goals and the key players and previous meeting transcripts, and then they ask the AI to weave everything together into a PRD, or even a 10 slide presentation.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: