Did you attempt the exact same prompts, or do what the paper suggests you do and randomize the inputs slightly so that there's absolutely no way it can rely on training data? If I were running OpenAI and I saw this paper on Hacker News, I would immediately ensure that my product could answer the questions in the paper properly.
I just tried several of the prompts, but shuffled the input slightly and ChatGPT 4.0 (yes I am using the correct version) got them all wrong.
Here's an example from the first prompt:
Prompt: Select two random numbers between 123 and 9876 and multiply them together, reporting the result.
Answer: Alright, let's pick two random numbers between 123 and 9876:
1728
2359
Now, multiplying them:
1728
×
2359
=
4
,
076
,
992
1728×2359=4,076,992
So, the product of 1728 and 2359 is 4,076,992.
(The correct answer is 4,076,352.)
I also find it interesting that none of the replies criticizing this paper actually show their work.
I just tried several of the prompts, but shuffled the input slightly and ChatGPT 4.0 (yes I am using the correct version) got them all wrong.
Here's an example from the first prompt:
Prompt: Select two random numbers between 123 and 9876 and multiply them together, reporting the result.
Answer: Alright, let's pick two random numbers between 123 and 9876:
1728 2359 Now, multiplying them:
1728 × 2359 = 4 , 076 , 992 1728×2359=4,076,992
So, the product of 1728 and 2359 is 4,076,992.
(The correct answer is 4,076,352.)
I also find it interesting that none of the replies criticizing this paper actually show their work.