Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> GPT-4 is pretty good at Math, nowhere near "falling apart".

its good at tasks which were included into training dataset in some variations.



I have played around with GPT-4 and some fairly simple but completely new math ideas. It was fabulous at identifying special cases I overlooked, that disproved conjectures.


Example?


I was playing around with prime numbers, and simple made up relationships between them, such as between the square of a prime N vs. the set of primes smaller than N, etc.

It caught me out with specific examples that violated my conjectures. In one case the conjecture held for all but one case, another conjecture was generally true but not for 2 and 3.

In one case it thought a conjecture I made was wrong, and I had to push it to think through why it thought it was wrong until it realized the conjecture was right. As soon as it had its epiphany, it corrected all its logic around that concept.

It was very simple stuff, but an interesting exercise.

The part I enjoyed the most was seeing GPT-4's understanding move and change as we pushed back on each other's views. You miss out on that impressive aspect of GPT-4 in simpler sessions.


Have you tried formalizing your ideas with Isabelle? It has a constraint solver and will often find counterexamples to false arithmetical propositions[1].

1: https://isabelle.in.tum.de/overview.html


I have not been able to figure out how that would help in the context of this discussion. As I see it, what’s very interesting here is that an LLM is able to do this.


I think the point is that LLM is not right tool for deep reasoning, and isabelle and others are much better such tools, even community trying to apply LLM in this area following current wave of hype.


curious why you referred specifically on isabelle, which looks ancient and over engineered, there are many other tools and langs in this area.

I am not criticizing, but curious about your opinion.


Isabelle is good at counter examples in ways few other proof assistants are. In general its automation is excellent, partly because it uses a less powerful logic (HOL instead of CIC; more expressive logics are harder to write automation for). It's not obsolete.


I have not, thanks for the tip.


its hard to judge how deep and unique your conjectures were.

I did similar testing of GPT4, and my observation is that it starts failing after 3-4 levels of reasoning depth.


Nice to see number of levels of reasoning depth mentioned. I personally believe the size of a (well-trained) LLM determines how many steps of reasoning in sequence it can approximate. Newer models get deeper and deeper, giving them deeper reasoning context windows. My hypothesis is that you don't need infinite reasoning depth, just a bit more than GPT-4 has. I think once you can tie your output together with thinking in terms of ~10+ reasoning steps you'll be very close to hunan performance.


> failing after 3-4 levels of reasoning depth.

That sounds more like an implementation or resource limitation, rather than an inherent limitation of the technique in general.


it is not obvious to me how you came to such conclusion.

LLMs got lots of investments: 10s of billions of dollars and tons of compute, maybe more than any other tech in history, and can't crack 3 steps reasoning. It sounds like tech limitation..


None of these systems or their training sets have been specifically tailored to tackle abstract reasoning or math, so that seems like a premature conclusion. The fact that they're decent at programming despite that is interesting.


They're also brand new and at some undetermined part of the sigmoid curve. Trying to predict where you are on the curve while in the middle of a sigmoid is a fools errand, the best you can do is make random predictions and hope you are accidentally correct so you can become a pundit later


Kinda able to do some math tasks some of the time whereas you can use techniques from the arithmetic textbook to get the right answer all of the time with millions of times less CPU even including the overhead of round-tripping to ASCII numerals which is shockingly large compared to what a multiply costs.

Kinda "the problem" with LLMS is that they successfully seduce people by seeming to get the right answer to anything 80% of the time.


The arithmetic issues are well documented and understood; it's a problem of sub-token manipulation, which has nothing to do with reasoning. (Similar to calling blind people unintelligent because they can't read the iq test.)

And the better llms can easily write code to do the attention that they suck at...


Excellent anology. LLMs are capable of many extraordinary things, and it’s a shame people dismiss them because they fail to live up to some specific test they invented.


Math is a lot more than just Arithmetic.


Yeah but if you can only do arithmetic right X% of the time you aren't going to get other answers right as often as would really be useful.

That said, LLMs have a magic ability to "short circuit" and get the right answer despite not being able to get the steps right. I remember scoping out designs for NLP systems about 5 years ago and frequently conclude that "that won't work" because information was lost at an early stage but in retrospect by short circuiting a system like that can outperform its parts but it still faces a ceiling on how accurate the answers are because the reasoning is not sound.


Human reasoning is amazingly not sound.

When you add in various patterns, double-checks, and memorized previous results, what human reasoning can do is astounding. But it is very, vary far from sound.


all currently available reasoning approaches are limited.

I guess the topic is how far GPT in reasoning is from human. We can take some simple tests:

- can GPT play chess as well as humans, as benchmark of reasoning games?

- did GPT prove some nontrivial math theorems or solve some math problems where humans couldn't find solution yet?


One thing I thought was amusing was that there was a burst of articles about Cyc that got mentioned when Doug Lenat died including this arXiv paper

https://arxiv.org/abs/2308.04445

and that one said that Cyc had over 1,100 special purpose reasoning engines. The general purpose resolution solver was nowhere near fast enough to be really useful.

Early one there was

https://en.wikipedia.org/wiki/General_Problem_Solver

which would be capable in principle of finding a winning move in a chess position but because it worked by exhaustive search it would practically take too long. The thing is that a good chess playing program is not generally intelligent just as a chess grandmaster isn't necessarily good at anything other than chess, it just has special purpose heuristics (as opposed to algorithms) that find good chess move.

ChatGPT-like systems will be greatly improved by coupling them to other systems such as "write a Python/SQL script then run it", "run a query against bing and summarize the results", and "go find the chess engine and ask it what move to make", that is, like Cyc, it will get a swiss army knife of tools that help it do things it's not good at but it doesn't create general intelligence any more than Cyc did.

Robert Penrose in the Emperor's New Mind suggests that there must be some quantum magic in the human mind because the human mind is able to solve any math problem whereas any machine is limited by Gödel's theorem. It's silly, however, because we don't humans are capable of proving any theorem: look at how we struggled with Fermat for nearly 360 years or how

https://en.wikipedia.org/wiki/Collatz_conjecture

seems not even tantalizingly out of reach.

The difference might be that humans feel bad when they get the wrong answer whereas ChatGPT certainly doesn't. (as much as its empty apology can be satisifying to people) This isn't just an attribute of humans, working with other animals such as horses I'm convinced that they feel bad when they screw up too.


> it will get a swiss army knife of tools that help it do things it's not good at but it doesn't create general intelligence any more than Cyc did

How do you know general intelligence is its own thing and not just a Swiss army knife of tools?

> because the human mind is able to solve any math problem whereas any machine is limited by Gödel's theorem

Any machine can be programmed to solve any problem at all, if the proof system is inconsistent. Which is probably exactly the case with humans. We work around it because different humans have different inconsistencies, so checking each other's work is how we average out those inconsistencies.


(As a person who went down the rabbit hole of knowledge-based systems and looked at Cyc quite a bit.)

Three forms of intelligence are (i) animal intelligence, (ii) language use, and (iii) abstract thinking.

Animals are intelligent in their own way, particularly socially intelligent. My wife runs a riding barn and it is clear to me that one of the things horses are most interested in is what the people and other horses are up to and that a horse doesn’t just have an opinion about other horses but they have an opinion about what opinion about what the other horses think about a horse. (e.g. Cyc has a system of microtheories and modalized logic that tries to get at this. Of course visual recognition and similar things are a big part of animal intelligence and boy have neural nets made progress there.)

Language is a unique capability of humans. (which Cyc made no real contribution to.)

If you get a PhD what you learn is how to develop systems of abstract thinking or at the very least go to conferences and acquire them or dig through the literature, dust them off and get them working. There is the aspect of individual creativity but also the “standing on the shoulders giants” that Newton talked about.

Before Lenat started on Cyc he was interested in expert systems for building expert systems or at the very least a set of development tools for doing the same and that was a motivation of Cyc even if the point of Cyc was to produce new knowledge bases and reasoning procedures that would live inside Cyc. The trouble is that this was a tortuous procedure and I did go through a phase of thinking about evaluating OpenCyc for a project but it had the problem that it would have taken at least six months just to get started with a project that could be finished in some other way much more quickly.

My own journey led through twists and turns but I came to see it as something like systems software development where you build tools like compilers and debuggers that transform inputs into a knowledge base and put it to work, but I very much gave up on “embedding in itself”

As for problems in general I don’t really know if they can all be solved? Isn’t it possible that there is no finite procedure to prove the Collatz conjecture?


> Language is a unique capability of humans.

No it's not. Language is well documented in dolphins, for instance. Crows have also demonstrated self awareness and ability to do arithmetic. I think your 3-part breakdown of intelligence is out of date. There's no rigourous evidence that intelligence breaks down in this way, it's just a "folk theory" at this point.


According to Gödel’s incompleteness theorem, some truths aren’t provable in the sense of acquiring proof through human reasoning and logic.


No it's just pretty good in general lol.


my experience is that it's pretty subpar


Are you using code interpreter? It's better.

The mobile app doesn't offer it though, and also has a system prompt that causes some strange behavior - sometimes it will put emojis in the text and then apologize for using emojis.


Care to share a GPT conversation you’ve had? I’m interested in what sorts of prompts lead you to this opinion. My experience is the opposite.


A bit too much of a hassle. But if you're willing to share some of your good experiences, I'm curious




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: