Hacker Newsnew | past | comments | ask | show | jobs | submit | mNovak's commentslogin

I'm excited for the big jump in ARC-AGI scores from recent models, but no one should think for a second this is some leap in "general intelligence".

I joke to myself that the G in ARC-AGI is "graphical". I think what's held back models on ARC-AGI is their terrible spatial reasoning, and I'm guessing that's what the recent models have cracked.

Looking forward to ARC-AGI 3, which focuses on trial and error and exploring a set of constraints via games.


Agreed. I love the elegance of ARC, but it always felt like a gotcha to give spatial reasoning challenges to token generators- and the fact that the token generators are somehow beating it anyway really says something.

The average ARC AGI 2 score for a single human is around 60%.

"100% of tasks have been solved by at least 2 humans (many by more) in under 2 attempts. The average test-taker score was 60%."

https://arcprize.org/arc-agi/2/


Worth keeping in mind that in this case the test takers were random members of the general public. The score of e.g. people with bachelor's degrees in science and engineering would be significantly higher.

Random members of the public = average human beings. I thought those were already classified as General Intelligences.

Average human beings with average human problems.

What is the point of comparing performance of these tools to humans? Machines have been able to accomplish specific tasks better than humans since the industrial revolution. Yet we don't ascribe intelligence to a calculator.

None of these benchmarks prove these tools are intelligent, let alone generally intelligent. The hubris and grift are exhausting.


What's the point of denying or downplaying that we are seeing amazing and accelerating advancements in areas that many of us thought were impossible?

It can be reasonable to be skeptical that advances on benchmarks may be only weakly or even negatively correlated with advances on real-world tasks. I.e. a huge jump on benchmarks might not be perceptible to 99% of users doing 99% of tasks, or some users might even note degradation on specific tasks. This is especially the case when there is some reason to believe most benchmarks are being gamed.

Real-world use is what matters, in the end. I'd be surprised if a change this large doesn't translate to something noticeable in general, but the skepticism is not unreasonable here.


The GP comment is not skeptical of the jump in benchmark scores reported by one particular LLM. It's skeptical of machine intelligence in general, claims that there's no value in comparing their performances with those of human beings, and accuses those who disagree with this take of "hubris and grift". This has nothing to do with any form or reasonable skepticism.

I would suggest it is a phenomenon that is well studied, and has many forms. I guess mostly identify preservation. If you dislike AI from the start, it is generally a very strongly emotional view. I don't mean there is no good reason behind it, I mean, it is deeply rooted in your psyche, very emotional.

People are incredibly unlikely to change those sort of views, regardless of evidence. So you find this interesting outcome where they both viscerally hate AI, but also deny that it is in any way as good as people claim.

That won't change with evidence until it is literally impossible not to change.


The hubris and grift are exhausting.

And moving the goalposts every few months isn't? What evidence of intelligence would satisfy you?

Personally, my biggest unsatisfied requirement is continual-learning capability, but it's clear we aren't too far from seeing that happen.


> What evidence of intelligence would satisfy you?

That is a loaded question. It presumes that we can agree on what intelligence is, and that we can measure it in a reliable way. It is akin to asking an atheist the same about God. The burden of proof is on the claimer.

The reality is that we can argue about that until we're blue in the face, and get nowhere.

In this case it would be more productive to talk about the practical tasks a pattern matching and generation machine can do, rather than how good it is at some obscure puzzle. The fact that it's better than humans at solving some problems is not particularly surprising, since computers have been better than humans at many tasks for decades. This new technology gives them broader capabilities, but ascribing human qualities to it and calling it intelligence is nothing but a marketing tactic that's making some people very rich.


(Shrug) Unless and until you provide us with your own definition of intelligence, I'd say the marketing people are as entitled to their opinion as you are.

I would say that marketing people have a motivation to make exaggerated claims, while the rest of us are trying to just come up with a definition that makes sense and helps us understand the world.

I'll give you some examples. "Unlimited" now has limits on it. "Lifetime" means only for so many years. "Fully autonomous" now means with the help of humans on occasion. These are all definitions that have been distorted by marketers, which IMO is deceptive and immoral.


> What evidence of intelligence would satisfy you?

Imposing world peace and/or exterminating homo sapiens


> Machines have been able to accomplish specific tasks...

Indeed, and the specific task machines are accomplishing now is intelligence. Not yet "better than human" (and certainly not better than every human) but getting closer.


> Indeed, and the specific task machines are accomplishing now is intelligence.

How so? This sentence, like most of this field, is making baseless claims that are more aspirational than true.

Maybe it would help if we could first agree on a definition of "intelligence", yet we don't have a reliable way of measuring that in living beings either.

If the people building and hyping this technology had any sense of modesty, they would present it as what it actually is: a large pattern matching and generation machine. This doesn't mean that this can't be very useful, perhaps generally so, but it's a huge stretch and an insult to living beings to call this intelligence.

But there's a great deal of money to be made on this idea we've been chasing for decades now, so here we are.


> Maybe it would help if we could first agree on a definition of "intelligence", yet we don't have a reliable way of measuring that in living beings either.

How about this specific definition of intelligence?

   Solve any task provided as text or images.
AGI would be to achieve that faster than an average human.

I still can't understand why they should be faster. Humans have general intelligence, afaik. It doesn't matter if it's fast or slow. A machine able to do what the average human can do (intelligence-wise) but 100 times slower still has general intelligence. Since it's artificial, it's AGI.

Wouldn't you deal with spatial reasoning by giving it access to a tool that structures the space in a way it can understand or just is a sub-model that can do spatial reasoning? These "general" models would serve as the frontal cortex while other models do specialized work. What is missing?

That's a bit like saying just give blind people cameras so they can see.

I mean, no not really. These models can see, you're giving them eyes to connect to that part of their brain.

They should train more on sports commentary, perhaps that could give spatial reasoning a boost.

Large murals on, for example, commercial buildings or residences are typically commissioned. These are big enough to require scaffolding/lifts and take multiple days to paint; with some exceptions (vacant property) it'd be hard to pull that off without the owner calling the cops. The building owner is paying them for the mural, or in some cases there's city grants or arts council projects.

Lots of muralists document the art/business on youtube! Two I like: Kiptoe and SmoeNova


Very cool! A couple notes from my first few sols:

- I had a really hard time building a greenhouse, because I hadn't realized it'd be bigger than 1 square like all the previous buildings, and it just wouldn't build despite having materials etc. Maybe a footprint outline while hovering a build option?

- There were a lot of instructions from Dr. Kimura right off the bat. Hard for me to remember all that, and I was hoping talking to the doc again would replay those hints.

- My population seems to be stuck at 2.. I have landing pads and habitats and plenty of food etc, but don't really know what I should be doing next.

- that menus continue beyond the first couple lines was not obvious to me. Possibly because I'm on laptop, so the existing hint was way far to the right


Got it, thanks for all this. There seems to be a bit of a bug with new colonists arriving so trying to fix that now. And agreed on your other points as well.

Maybe to allow sub-vocalized commands when wearing airpods, for example? I think this was a theme in the later Ender's Game series books.


I'm going to call BS on that chart of "AI-driven chip design". What "AI" tools has Cadence been providing since 2021 that are reaching 40-50% of "chip design" (what does that even mean?). Is AI here just any old algorithmic auto-router? Or a fuzzy search of the IP library?


So when can an AI call up the cable company and negotiate a discount? Asking for a friend.

But seriously, other tasks I've encountered recently that I wish I could delegate to an AI:

- Posting my junk to Craigslist, determining a fair price, negotiating a buyer (pickup only!)

- Scheduling showings to find an apartment, wherein the listing agents are spread over multiple platforms, proprietary websites, or phone contacts

- Job applications -- not forging a resume, but compiling candidate positions with reasoning, and the tedious part were you have to re-enter your whole resume into their proprietary application pipeline app

What strikes me as basic similarities across these types of things, is that they are essentially data-entry jobs which interact with third-party interfaces, with CRM-like follow up requirements, and require "good judgement" (reading reviews, identifying scams, etc).


Possibly unlikely to occur if prompt injection remains possible. I’ll just have my counter party ai prompt inject yours to negotiate a better deal on my behalf.


Bloat the apps, push users into a high tier iPhone, some % of users settle for a more affordable Pixel. Not that Android apps are that much better.


I agree that the worker becoming more productive due to some capital equipment generally won't see any benefit from doing so (unless it requires special skills to operate). But I think the argument is that the end consumer will eventually benefit from the increased productivity.

In your example, yes, the factory owner can take their $100/hr of profit. But among the various factories, some owner might take $25 of that profit and instead undercut their competition to grow their order book. Other factories respond in kind, and the consumer is getting cheaper products.


Doesn't OpenRouter prove that inference is profitable? Why would random third parties subsidize the service for other random people online? Unless you're saying that only large frontier models are unprofitable, which I still don't think is the case but is harder to prove.


While this example is explicitly asking for a port (thus a copy), I also find in general that LLM's default behavior is to spit out new code from their vast pre-trained encyclopedia, vs adding an import to some library that already serves that purpose.

I'm curious if this will implicitly drive a shift in the usage of packages / libraries broadly, and if others think this is a good or bad thing. Maybe it cuts down the surface of upstream supply-chain attacks?


As a corollary, it might also increase the surface of upstream supply-chain attacks (patched or not)

The package import thing seems like a red herring


It's going to be fun if someone finds a security vulnerability in a commonly-emitted-by-LLMs code pattern. That'll be a lot harder to remediate than "Update dependency xyz"


> if someone finds a security vulnerability in a commonly-emitted-by-LLMs code pattern

how do you distinguish this from injecting a vulnerable dependency to a dependency list?


You can more easily check for known-vulnerable dependencies


Right, but if you can embed bad packages in LLMs, you can surely embed any kind of vulnerability imaginable.


I'm not thinking about deliberately embedded vulnerabilities, just accidental/emergent ones. The modern equivalent of devs copy-pasting stackoverflow answers that happen to contain SQL injection vulns.


Does the distinction make any difference?


Yes, you'd take different actions to avoid each.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: