Hacker News new | past | comments | ask | show | jobs | submit | moozilla's comments login

I'm surprised this doesn't get brought up more often, but I think the main explanation for the divide is simple: current LLMs are only good at programming in the most popular programming languages. Every time I see this brought up in the HN comments section and people are asked what they are actually working on that the LLM is not able to help with, inevitably it's using a (relatively) less popular language like Rust or Clojure. The article is good example of this, before clicking I guessed correctly it would be complaining about how LLMs can't program in Rust. (Granted, the point that Cursor uses this as an example on their webpage despite all of this is funny.)

I struggled to find benchmark data to support this hunch, best I could find was [1] which shows a performance of 81% with Python/Typescript vs 62% with Rust, but this fits with my intuition. I primarily code in Python for work and despite trying I didn't get that much use out of LLMs until the Claude 3.6 release, where it suddenly crossed over that invisible threshold and became dramatically more useful. I suspect for devs that are not using Python or JS, LLMs have just not yet crossed this threshold.

[1] https://terhech.de/posts/2025-01-31-llms-vs-programming-lang...


As someone working primarily with Go, JS, HTML and CSS, I can attest to the fact that the choice of language makes no difference.

LLMs will routinely generate code that uses non-existent APIs, and has subtle and not-so-subtle bugs. They will make useless suggestions, often leading me on the wrong path, or going in circles. The worst part is that they do so confidently and reassuringly. I.e. if I give any hint to what I think the issue might be, after spending time reviewing their non-working code, then the answer is almost certainly "You're right! Here's the fix..."—which either turns out to be that I was wrong and that wasn't the issue, or their fix ends up creating new issues. It's a huge waste of my time, which would be better spent by reading documentation and writing the code myself.

I suspect that vibe coding is popular with developers who don't bother reviewing the generated code, either due to inexperience or laziness. They will prompt their way into building something that on the surface does what they want, but will fail spectacularly in any scenario they didn't consider. Not to speak of the amount of security and other issues that would get flagged by an actual code review from an experienced human programmer.


Here's an attempt at cleaning it up with Gemini 2.5 Pro: https://rentry.org/nyznvoy5

I just pasted the YouTube link into AI Studio and gave it this prompt if you want to replicate:

reformat this talk as an article. remove ums/ahs, but do not summarize, the context should be substantively the same. include content from the slides as well if possible.


Pretty good, except it’s not Bismarck but Fontane. ;) Also, I’m comparing myself to CGP Grey, not whatever it’s transcribed. :D

Thanks, saved me so much time

Which plugins are you using? I've been looking to upgrade my zsh experience so some suggestions would be helpful.


OP is likely referring to people who call LLMs "stochastic parrots" (https://en.wikipedia.org/wiki/Stochastic_parrot), and by "doomers" (not boomers) they likely mean AI safetyists like Eliezer Yudkowsky or Pause AI (https://pauseai.info/).


Highly recommend Three-Body, the Chinese version of the Three-Body Problem. I enjoyed it much more than the Netflix adaptation, much closer to the source material, and more of a slow burn. Episodes are available on YouTube with subs (https://www.youtube.com/watch?v=3-UO8jbrIoM).


Isn't the disillusionment of the main scientist related to the violent abuse of the CCP (and he loss of faith in humanity) core to the reasoning of why she reached out to the aliens, despite their warning? How do they restructure something so core to the plot?


Yeah, I mentioned 三体 in a parent comment. It's a great counterpoint to the "high fructose" Netflix version. And interesting to see the American character portrayed by an American actor...dubbed by a Chinese voice actor. (Just be prepared to fast-forward the musical interludes.)


The link the article uses to source the 60 GWh claim (1) appears to be broken, but all of the other sources I found give similar numbers, for example (2) which gives 50 GWh. This is specifically to train GPT-4, GPT-3 was estimated to have taken 1,287 MWh in (3), so the 50 GWh number seems reasonable.

I couldn't find any great sources for the 200 plane flights number (and as you point out the article doesn't source this either), but I asked o1 to crunch the numbers (4) and it came up with a similar figure (50-300 flights depending on the size of the plane). I was curious if the numbers would be different if you considered emissions instead of directly converting jet fuel energy to watt hours, but the end result was basically the same.

[1] https://www.numenta.com/blog/2023/08/10/ai-is-harming-our-pl...

[2] https://www.ri.se/en/news/blog/generative-ai-does-not-run-on...

[3] https://knowledge.wharton.upenn.edu/article/the-hidden-cost-...

[4] https://chatgpt.com/share/678b6178-d0e4-800d-a12b-c319e324d2...


Couldn't find the JRE clip, but here's a recent one where he says "I don't really need more money." This is how I always understood it, he's already worth billions from past ventures, what difference does a stake in OpenAI make?

https://www.youtube.com/watch?v=PScOZzzXnDA


Here's a similar study that answers your question: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10282813/

> Drinking 1–5 cups/day of ground or instant coffee (but not decaffeinated coffee) was associated with a significant reduction in incident arrhythmia, including AF. The lowest risk was at 4–5 cups/day for ground coffee (HR 0.83; 95% CI [0.76–0.91]; P <0.0001) and 2–3 cups/day for instant coffee (HR, 0.88; 95% CI [0.85–0.92]; P <0.0001).

tl;dr Yes it has similar benefits, maybe slightly worse than "ground coffee" (I wish they had broken it down more granularly)


Thanks, exactly what I was looking for.

Interesting that lowest risk is at 4-5 cups for ground but 2-3 for instant.


One point I haven't seen mentioned yet is that a11y provides obvious business value in that it forces devs to write better and more testable code. I first noticed this when using react-testing-library [1], when refactoring my code to be more easily testable became equivalent to adding a11y features.

Example from a project I worked on: I needed to test that when a button is clicked that the app showed a spinner when loading and then the content when the API call completed successfully. The spinner component was just an SVG with no obvious way to select it without adding a test-id, so instead I refactored the app to use an aria-busy attribute [2] in the container where the content is loading. The test then becomes something like this:

  test('shows spinner while loading and content after API call', async () => {
    render(<Example />);

    userEvent.click(screen.getByRole('button', { name: /load content/i }));

    expect(screen.getByRole('main')).toHaveAttribute('aria-busy', 'true');

    await waitFor(() => {
      expect(screen.getByRole('main')).toHaveAttribute('aria-busy', 'false');
      expect(screen.getByText(/content loaded/i)).toBeInTheDocument();
    });
  });
[1] https://testing-library.com/docs/queries/about#priority [2] https://developer.mozilla.org/en-US/docs/Web/Accessibility/A...


Apparently it is possible to measure how uncertain the model is using logprobs, there's a recipe for it in the OpenAI cookbook: https://cookbook.openai.com/examples/using_logprobs#5-calcul...

I haven't tried it myself yet, not sure how well it works in practice.


There’s a difference between certainty of the next token given the context and the model evaluation so far and certainty about an abstract reasoning process being correct given it’s not reasoning at all. These probabilities and stuff coming out are more about token prediction than “knowing” or “certainty” and are often confusing to people in assuming they’re more powerful than they are.


> given it’s not reasoning at all

When you train a model on data made by humans, then it learns to imitate but is ungrounded. After you train the model with interactivity, it can learn from the consequences of its outputs. This grounding by feedback constitutes a new learning signal that does not simply copy humans, and is a necessary ingredient for pattern matching to become reasoning. Everything we know as humans comes from the environment. It is the ultimate teacher and validator. This is the missing ingredient for AI to be able to reason.


Yeah but this doesn't change how the model functions, this is just turning reasoning into training data by example. It's not learning how to reason - it's just learning how to pretend to reason, about a gradually wider and wider variety of topics.

If any LLM appears to be reasoning, that is evidence not of the intelligence of the model, but rather the lack of creativity of the question.


Humans are only capable of principled reasoning in domains where they have expertise. We don't actually do full causal reasoning in domains we don't have formal training in. We use all sorts of shortcuts that are similar to what LLMs are doing.

If you consider AlphaTensor or other products in the Alpha family, it shows that feedback can train a model to super-human levels.


What's the difference between reasoning and pretending to reason really well?


It’s the process by which you solve a problem. Reasoning requires creating abstract concepts and applying logic against them to arrive at a conclusion.

It’s like saying what’s the difference between between deductive logic and Monte Carlo simulations. Both arrive at answers that can be very similar but the process is not similar at all.

If there is any form of reasoning on display here it’s an abductive style of reasoning which operates in a probabilistic semantic space rather than a logical abstract space.

This is important to bear in mind and explains why hallucinations are very difficult to prevent. There is nothing to put guard rails around in the process because it’s literally computing probabilities of tokens appearing given the tokens seen so far and the space of all tokens trained against. It has nothing to draw upon other than this - and that’s the difference between LLMs and systems with richer abstract concepts and operations.


Naive way of solving this problem is to ie. run it 3 times and seeing if it arrives at the same conclusion 3 times. More generally running it N times and calculating highest ratio. You trade compute for widening uncertainty window evaluation.


You can ask the model sth like: is xyz correct, answer with one word, either Yes or No. The log probs of the two tokens should represent how certain it is. However, apparently RLHF tuned models are worse at this than base models.


Seems like functions could work well to give it an active and distinct choice, but I'm still unsure if the function/parameters are going to be the logical, correct answer...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: